CN113191493A

CN113191493A - Convolutional neural network accelerator based on FPGA parallelism self-adaptation

Info

Publication number: CN113191493A
Application number: CN202110461762.1A
Authority: CN
Inventors: 袁海英; 曾智勇
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-30
Anticipated expiration: 2041-04-27
Also published as: CN113191493B

Abstract

The invention discloses a convolutional neural network accelerator based on FPGA parallelism self-adaptation, which comprises: the device comprises a read command generator, a data distributor, an operation cluster group, an addition tree group, an output cache group and an output arbiter. The accelerator configures the parallelism as either multiple or single output activation parallelism, depending on the structure of the convolutional layer. The data distributor may analyze the consistency of the data of the on-chip cache and broadcast the repeated off-chip data to the corresponding caches at the same time. A plurality of operation units in the operation cluster are responsible for convolution operation on different input channels; according to the convolution operation structure, the operation cluster is configured to be operated on different input channels or activated by different outputs. The output cache group contains a data routing module, and for an operation cluster which is activated by the same output of operation, the data in the corresponding output cache is connected end to end, otherwise, the data is output independently. The accelerator not only ensures the perception sparse activation and reduces the operation load, but also can flexibly configure the operation unit, and greatly improves the parallel operation efficiency of the FPGA.

Description

Convolutional neural network accelerator based on FPGA parallelism self-adaptation

Technical Field

The invention relates to the field of digital integrated circuits, electronic information and deep learning, in particular to a convolutional neural network accelerator based on FPGA parallelism self-adaption.

Background

With the breakthrough of deep learning technology in the fields of scientific research, life, production and national defense and military, the Convolutional Neural Network (CNN) model has been greatly successful. The reason why the size of each convolution layer in the convolution model is too different is one of the reasons for limiting the performance improvement of the accelerator. Since the input channels of the front layer in the CNN are too small, the 0 th layer of the typical VGG-16 for image recognition has only 3 input channels, and the accelerator with larger parallelism cannot achieve the proper operation performance in the front layer in the CNN. Sparse-sensing accelerators perform worse in this respect due to the need for greater granularity of control. If the operation of the matrix type operation unit is more flexible, the operation parallelism of the sparse sensing accelerator can be improved on the premise of ensuring the operation efficiency.

Disclosure of Invention

The invention provides a convolutional neural network accelerator based on FPGA parallelism self-adaption, aiming at solving the problem that operation circulation cannot be completely tiled on an operation unit due to inflexible expansion of an operation parallelism by utilizing a neural network accelerator scheme in the prior art, and expanding the dimensionality of parallelization tiling on the basis of finishing sparse activation perception, and the accelerator can control the sizes of three parallelism on line, so that the accelerator still has higher utilization rate of the operation unit when the size of a convolutional layer is smaller.

In order to achieve the technical purpose, the invention adopts the technical scheme that:

a convolutional neural network accelerator based on FPGA parallelism adaptation, comprising: the device comprises a read command generator, a data distributor, an operation cluster group, an addition tree group, an output cache group and an output arbiter.

The operation mode of the deep convolution neural network accelerator based on FPGA parallelism self-adaption is as follows: the operating parameters of the accelerator are first written externally to the configuration registers and these parameters are broadcast into each operational cluster. The command generator generates read requests of activation and weight, and the read requests adopt two independent data channels for data transmission; a read request command for dynamically sending data to an external bus by an arbiter; and reading data in the off-chip memory into a data distributor through an external bus, distributing the data to different operation clusters through the data distributor, and performing convolution operation on the operation clusters. And selecting an addition isolation mode and the number of effective outputs by the root multi-output addition tree, wherein the number of the effective outputs is the same as the output activation parallelism. The data routing module sets different output cache connection modes according to the parallelism, and the connection modes ensure that each output activated parallel operation is connected to one output cache. Finally, one output arbitrator writes different output buffers back to the off-chip memory.

The output arbiter selects from which output buffer the data is to be output, outputs the data according to the request level of the output buffer, and triggers a low level request when the output buffer has a data amount of half of the burst length, and triggers a high level request when the output buffer has a data amount of the burst length. When the arbiter does not have a high request level, outputting the data requested by a low level; and outputting the data corresponding to the higher request level when the higher request level exists.

Optionally, the data distributor performs data unicast or broadcast according to data consistency in the operation cluster. When different outputs of different operation clusters are activated, the data distributor simultaneously broadcasts the weight and the activation of the last (k-stride) w _ in to different operation clusters, wherein k is the size of a convolution kernel, stride is the size of a stride, and win is the width of an input characteristic diagram; when different operation clusters operate the same output activated different input channels, the data distributor performs unicast and sequentially transmits the weight and the activation data to the corresponding operation clusters.

Optionally, the operation cluster group includes Tp operation clusters composed of operation units and adders, and each operation unit includes an on-chip activation and weight cache, an address generator, a responder, a sparse sensor, a non-0 cache, and Tn multiply-accumulators. And each multiply-accumulator calculates the activation of Tn output channels, and multiply-accumulator results of different operation units in the same operation cluster are added by an adder to serve as the output of the operation cluster. The operation cluster group comprises Tm operation units which are averagely distributed to each operation cluster. Wherein Tm is an integral multiple of Tp. Tm, Tp and Tn are hardware configuration parameters, the configuration scheme is that Tm multiplied by Tn is smaller than the number of DSPs in the FPGA, and Tp is 2 at minimum and 2 at maximum, and is an integral power of 2. The operation cluster group can be used for operating convolution operations corresponding to different output channels and different input channels (or different output activations) of the same convolution layer in parallel. Setting the number of input channels of the convolutional layer as Chin, when Chin is larger than or equal to Tm multiplied by Tn,each cluster operates on the convolution of ceil (Chin/Tp) input channels, where "ceil ()" is an upward rounding function. Chin<When Tm is multiplied by Tn, Tu is taken as the index satisfying Chin<2^TuMinimum integer of x Tn, convolution operation of each operation cluster corresponding to ceil (Hout/2)^Tu) The output of the row activates and ceil (Chin/Tn) input channels. Each operation cluster is responsible for convolution operation of the same number of output channels. And the activation buffer output bit width of the operation units in the operation cluster group is 16 × Tn, and the activation buffer outputs Tn activations to the sparse activation sensor. And extracting non-zero values and corresponding offset values in Tn activations by a sparse activation sensor in the operation unit, after the activation signals are cached by a non-0 buffer, reading the position c by a weight cache according to the current weight and marking the specific non-0 activation position c + off by the offset value off output by the non-0 buffer, carrying out weight addressing to match the weights of Tn output channels, and then executing c-c + Tn.

Optionally, the addition tree group has a multi-node output function and is composed of Tn multi-segment addition trees. The input port of each adder stage is provided with a first-in first-out memory for isolating the operations of different adders. The input end of each stage of adder is connected to an operation cluster, the input value of each stage of adder is the convolution part sum from different input channels, the output value of each stage of adder is used as an independent output, the output value of each stage of adder is different output activation, different output activation caches are linked when Tu is not 0, otherwise, only the output of the last stage is effective, and the other stages are discarded.

Optionally, the storage depth of the output buffer group is consistent with the burst length of the external bus, and corresponds to the operation clusters one by one, and the input end of the output buffer group is connected to each adder of the multi-segment addition array. The output buffer group has data route module composed of Tm/Tp multiplexers, input terminal connected to each adder in the addition tree and the output of the previous multiplexer, output terminal connected to the input terminal of the next multiplexer, data selection mode of multiplexer is every 2^TUThe output and input of each multiplexer are connected end to end.

A single operation multi-task data stream based on FPGA activates the output of convolution operation and decomposes one convolution operation into Tu independently operated tasks in the row direction, wherein the data stream in each task is the same and flows in parallel in different operation clusters without interference. And the data stream in the task is sequentially expanded by an output channel, an input channel and a convolution window of convolution circulation, and the task performs convolution parallel operation of Tn output channels and the input channel of Tm multiplied by Tu.

The technical scheme adopted by the invention has the advantages and beneficial effects that:

the flexible parallelism expansion architecture of the sparse activation accelerator is realized, the dimension of parallelization tiling is expanded by the architecture, and the sizes of three parallelisms can be controlled on line. When the volume of the input channel of the convolution layer is insufficient, the accelerator uses the original input channel expansion operation unit for expanding and outputting the activation parallelism, so that the utilization rate of the operation unit is improved, and higher throughput rate is realized. The operation participated by effective skipping 0 activation can reduce the operation load on the premise of not influencing the operation result. The data distributor transmits repeated on-chip data in a broadcast mode, so that reading of off-chip redundant data can be greatly reduced, and dependence of an accelerator on bandwidth is reduced

Drawings

FIG. 1 is a schematic structural view of the present invention;

FIG. 2 is a diagram illustrating a structure of an operation cluster;

FIG. 3 is a schematic diagram of a data routing architecture;

Detailed Description

As shown in fig. 1, the present embodiment relates to a depth convolution neural network accelerator based on FPGA parallelism adaptation, which includes: the device comprises a read command generator, a data distributor, an operation cluster group, an addition tree group, an output cache group and an output arbiter.

The operation mode of the deep convolution neural network accelerator based on FPGA parallelism self-adaption is as follows: the operating parameters of the accelerator are first written externally to the configuration registers and these parameters are broadcast into each operational cluster. The command generator generates read requests of activation and weight, and the read requests adopt two independent data channels for data transmission; a read request command for dynamically sending data to an external bus by an arbiter; and reading data in the off-chip memory into a data distributor through an external bus, distributing the data to different operation clusters through the data distributor, and performing convolution operation on the operation clusters. According to the difference of the parallelism, the multi-output addition tree selects the addition isolation mode and the number of effective outputs, and the number of the effective outputs is the same as the parallelism of the output activation. In order to fully utilize the output cache, the data routing module sets different output cache connection modes according to the parallelism, and the connection modes ensure that each output activation parallel operation is connected to one output dump. And finally, writing different output caches back to the off-chip memory by an arbiter, and configuring an output address by the arbiter according to the physical position of an operation cluster where different output activation operations are located in order to enable multi-output activation parallel operations to be coupled into a single operation again.

The read command generator is used for sending a read request to an external bus for addressing activation and weight data stored in an off-chip memory, the read request is carried out by taking the activation and weight of Tn input channels as a unit, and the read sequence is as follows: the feature map is from width to height to input channel depth; the weights are from width to height, and then from input channel depth to output channel depth.

The data streams in the operation clusters are the same and flow in parallel without interfering with each other, and the data streams are sequentially expanded by an output channel, an input channel and a convolution window of a convolution cycle.

The operation cluster group comprises Tp operation clusters consisting of operation units and adders, and each operation unit comprises an on-chip activation and weight cache, an address generator, a responder, a sparse perceptron, a non-0 cache and Tn multiply-accumulators. And each multiply-accumulator calculates the activation of Tn output channels, and multiply-accumulator results of different operation units in the same operation cluster are added by an adder to serve as the output of the operation cluster. The operation cluster can be flexibly configured to input operation of channels or output activated operation. The operation cluster group comprises Tm operation units which are averagely distributed to each operation cluster. Wherein Tm is an integral multiple of Tp. And Tm, Tp and Tn are hardware configuration parameters, and can be configured off line according to the operation resources and the target network of the FPGA.

The structure of the operation cluster is shown in FIG. 2, which can beAnd performing parallel operation on different output channels and different input channels of the same convolution layer or different outputs to activate corresponding convolution operation. When the number Chin of input channels of the convolution layer is larger than or equal to Tm multiplied by Tn, convolution of ceil (Chin/Tp) input channels is calculated by each operation cluster, wherein 'ceil ()' is an upward integer function. Chin<When Tm is multiplied by Tn, Tu is taken as the index satisfying Chin<2^TuMinimum integer of x Tn, convolution operation of each operation cluster corresponding to ceil (Hout/2)^Tu) The output of the row activates and ceil (Chin/Tn) input channels. Each cluster of operations operates on the convolution of the same number of output channels. Each operation cluster group comprises Tp operation units, the output bit width of an activation buffer in each operation unit is 16 x Tn, and Tn activations are output to the sparse activation sensor. The sparse activation perceptron extracts non-zero values and corresponding offset values of Tn activation, after the non-0 cache, the weight cache can mark specific non-0 activation positions c + off according to the current weight reading position c and the offset value off output by the non-0 cache, weight addressing is carried out to match the weights of Tn output channels, in a multiplication accumulator, the weights of the Tn output channels and one non-0 activation are subjected to parallel multiplication and accumulation operation, and then c is executed as c + Tn.

And the data distributor performs unicast or broadcast of data according to the data consistency in the operation cluster. When the operation cluster group operates convolution operation corresponding to different output channels and different output activations of the same convolution layer, Tu is taken as the condition of satisfying Chin<2^TuThe smallest integer of XTn, where "ceil ()" is the ceiling function, and Chin is the size of the input channel. Every 2 th^TuEach operation cluster receives the same weight data broadcasted by the data distributor, and the other operation clusters receive different weight data unicast by the data distributor. Each 2 is^TuThe last w _ in (k-stride) active data of the operation clusters are received from the data distributor, the rest are unicast, wherein w _ in is the input active width, k is the convolution kernel size, and stride is the stride.

The addition tree of the addition tree group has a multi-node output function, the value of each level of adder is used as an independent output, the output value is activated for different outputs, when Tu is not 1, the buffer is activated corresponding to different outputs, otherwise, only the output of the last level is effective, and the other levels are discarded. Taking the 8 operation cluster structure shown in fig. 3 as an example, each output passes through a data routing structure composed of a plurality of multiplexers, where AL _ a _ b represents the output from the b-th adder of the a-th addition tree, and OB _ c represents the data buffered in the c-th output. Each multiplexer is responsible for data routing of one output buffer, and the output result of the addition tree is sent to the corresponding output buffer according to the Tu value. Except for the first multiplexer, there are input ports from the last output buffer. Thus, multiple output buffers can be re-linked into one larger buffer when Tu is smaller.

The storage depth of the output buffer is consistent with the burst length of the external bus, the output buffer has 2 request levels to the output arbiter, when the output buffer has the data volume of half of the burst length, a low-level request is triggered, and when the output buffer has the data volume of the burst length, a high-level request is triggered. When the arbiter does not have a high request level, outputting the data requested by a low level; and outputting the data corresponding to the higher request level when the higher request level exists.

Claims

1. A convolutional neural network accelerator based on FPGA parallelism adaptation is characterized by comprising: the device comprises a read command generator, a data distributor, an operation cluster group, an addition tree group, an output cache group and an output arbiter;

the operation mode of the deep convolution neural network accelerator based on FPGA parallelism self-adaption is as follows: firstly, externally writing the operating parameters of an accelerator into a configuration register, and broadcasting the parameters into each operation cluster; the command generator generates read requests of activation and weight, and the read requests adopt two independent data channels for data transmission; a read request command for dynamically sending data to an external bus by an arbiter; reading data in the off-chip memory into a data distributor through an external bus, distributing the data to different operation clusters through the data distributor, and performing convolution operation on the operation clusters; selecting an addition isolation mode and the number of effective outputs by a root multi-output addition tree, wherein the number of the effective outputs is the same as the output activation parallelism; the data routing module sets different output cache connection modes according to the parallelism, and the connection modes ensure that each output activated parallel operation is connected to one output cache; finally, one output arbitrator writes different output buffers back to the off-chip memory;

the output arbiter selects which output buffer to output data from, and outputs data according to the request level of the output buffer, and when the output buffer has a data volume of half of the burst length, a low-level request is triggered, and when the output buffer has a data volume of the burst length, a high-level request is triggered; when the arbiter does not have a high request level, outputting the data requested by a low level; and outputting the data corresponding to the higher request level when the higher request level exists.

2. The FPGA parallelism adaptive-based convolutional neural network accelerator according to claim 1, wherein the data distributor unicasts or broadcasts data according to data consistency in the operation clusters; when different outputs of different operation clusters are activated, the data distributor simultaneously broadcasts the weight and the activation of the last (k-stride) w _ in to different operation clusters, wherein k is the size of a convolution kernel, stride is the size of a stride, and win is the width of an input characteristic diagram; when different operation clusters operate the same output activated different input channels, the data distributor performs unicast and sequentially transmits the weight and the activation data to the corresponding operation clusters.

3. The FPGA parallelism-adaptive convolutional neural network accelerator according to claim 1, wherein the operation cluster group comprises Tp operation clusters consisting of operation units and adders, and each operation unit comprises an on-chip activation and weight cache, an address generator, a responder, a sparse sensor, a non-0 cache and Tn multiply accumulators; activating each multiply-accumulator operation Tn output channels, and adding the multiply-accumulator results of different operation units in the same operation cluster by an adder to serve as the output of the operation cluster; the operation cluster group comprises Tm operation units which are averagely distributed to each operation cluster; wherein Tm is integral multiple of Tp; the Tm, Tp and Tn are hardware configuration parameters and the configuration thereofThe setting scheme is that Tm multiplied by Tn is smaller than the number of DSPs in the FPGA, Tp is minimum 2 and maximum Tm which is an integral power of 2; the operation cluster group can parallelly operate convolution operations corresponding to different output channels and different input channels (or different output activations) of the same convolution layer; setting the number of input channels of the convolution layer as Chin, and when Chin is larger than or equal to Tm multiplied by Tn, calculating the convolution of ceil (Chin/Tp) input channels by each operation cluster, wherein 'ceil ()' is an upward integer function; chin<When Tm is multiplied by Tn, Tu is taken as the index satisfying Chin<2^TuMinimum integer of x Tn, convolution operation of each operation cluster corresponding to ceil (Hout/2)^Tu) The output of the row is activated and ceil (Chin/Tn) input channels; each operation cluster is responsible for convolution operation of the same number of output channels; the activation cache output bit width of the operation units in the operation cluster group is 16 × Tn, and the activation cache output bit width outputs Tn activations to the sparse activation sensor; and extracting non-zero values and corresponding offset values in Tn activations by a sparse activation sensor in the operation unit, after the activation signals are cached by a non-0 buffer, reading the position c by a weight cache according to the current weight and marking the specific non-0 activation position c + off by the offset value off output by the non-0 buffer, carrying out weight addressing to match the weights of Tn output channels, and then executing c-c + Tn.

4. The FPGA parallelism adaptive convolutional neural network accelerator of claim 1, wherein said set of addition trees has a multi-node output function and consists of Tn multi-segment addition trees; the input port of each stage of adder is provided with a first-in first-out memory for isolating the operation of different adders; the input end of each stage of adder is connected to an operation cluster, the input value of each stage of adder is the convolution part sum from different input channels, the output value of each stage of adder is used as an independent output, the output value of each stage of adder is different output activation, different output activation caches are linked when Tu is not 0, otherwise, only the output of the last stage is effective, and the other stages are discarded.

5. The FPGA parallelism-adaptive convolutional neural network accelerator according to claim 1, wherein the storage depth of the output buffer group is consistent with the burst length of an external busAnd the input end of the adder is connected to each adder of the multi-section addition arrays; the output buffer group has data route module composed of Tm/Tp multiplexers, input terminal connected to each adder in the addition tree and the output of the previous multiplexer, output terminal connected to the input terminal of the next multiplexer, data selection mode of multiplexer is every 2^TUThe output and input of each multiplexer are connected end to end.

6. A single operation multi-task data stream based on FPGA is characterized in that the output of convolution operation is activated, one convolution operation is decomposed into Tu tasks which run independently in the row direction, the data stream in each task is the same, and the parallel flow in different operation clusters is not interfered with each other; and the data stream in the task is sequentially expanded by an output channel, an input channel and a convolution window of convolution circulation, and the task performs convolution parallel operation of Tn output channels and the input channel of Tm multiplied by Tu.