CN113805840B

CN113805840B - Fast accumulator

Info

Publication number: CN113805840B
Application number: CN202111365222.XA
Authority: CN
Inventors: 王中风; 王美琪
Original assignee: Nanjing Fengxing Technology Co ltd
Current assignee: Nanjing Fengxing Technology Co ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-05-03
Anticipated expiration: 2041-11-18
Also published as: CN113805840A

Abstract

The embodiment of the application provides a fast accumulator, including: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises an adder and a register, and the adder is circularly connected with the input end and the output end of the register; except the first group and the last group of the addition modules, each group of the addition modules comprises a first carry, a second carry, a first carry-out bit and a second carry-out bit; the second carry bit is directly connected with the second output bit of the previous group of addition modules through a register. By inserting the register into the carry chain between the groups, the critical path is obviously reduced, and the upper frequency limit of the system is improved.

Description

Fast accumulator

Technical Field

The present application relates to the field of digital signal processing technology, and more particularly, to a fast accumulator.

Background

In order to improve the throughput rate of the digital signal processing system, a critical path of the system is generally reduced by inserting a pipeline, so that the operating frequency of the system is improved, and the throughput rate is further improved, wherein the critical path refers to a logic calculation path with the longest time delay in all paths which do not pass through a register unit in a circuit. Insertion into the pipeline, i.e., inserting registers into the feed-forward cut-set of the circuit (cut-set is a collection of edges in the graph that are removed, the graph becomes an unconnected graph, and all edges on the feed-forward cut-set point forward, i.e., in the direction of the input to the output), is an effective way to optimize the critical path of the loop-free circuit.

However, in many digital signal processing systems, the accumulator shown in fig. 1 is involved, iterative operations involved in the accumulator introduce a loop, and inserting a register directly into this loop may destroy the correctness of the calculation, so that the path delay of the accumulator is difficult to reduce, and is easily a bottleneck in the optimization of the critical path of the system.

Disclosure of Invention

In order to meet the critical path delay requirement of the system and reduce the path delay of the accumulator, the present application provides a fast accumulator through the following embodiments.

A first aspect of the present application provides a fast accumulator, comprising: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises an adder and a register, and the adder is circularly connected with the input end and the output end of the register;

except the first group and the last group of the addition modules, each group of the addition modules comprises a first carry, a second carry, a first carry-out bit and a second carry-out bit; the first carry is used for receiving and inputting data of corresponding bit in the data to be accumulated, and the second carry is directly connected with the second output bit of the previous group of addition modules through a register;

the first group of the addition modules comprises a first carry, a first carry-out bit and a second carry-out bit, and the last group of the addition modules comprises a first carry, a second carry and a first carry-out bit;

for a first set of the summing modules, the first output bit is connected directly to an output port of the fast accumulator; for the other groups of the addition modules, the first carry-out bit and the second carry-out bit are both connected to the input end of the merge adder, and the output end of the merge adder is connected to the output port of the fast accumulator.

Optionally, if each group of the addition modules includes two or more addition units, the adders in all the addition units are connected in sequence.

Optionally, if each group of the addition modules includes two or more addition units, the merge adder further includes a zero-insertion carry.

A second aspect of the present application provides a fast accumulator comprising: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises a first adder, a second adder and a register which are sequentially connected, and the output end of the register is connected to the input end of the second adder;

except the first group and the last group of the addition modules, each other group of the addition modules comprises a first carry, a second carry, a third carry, a first carry-out bit, a second carry-out bit and a third carry-out bit;

the first carry is the input end of the first adder in each addition unit and is used for receiving and inputting data of corresponding bit in the data to be accumulated, the second carry is the input end of the first adder in the addition unit with the lowest bit, and the second carry is directly connected with the second carry-out bits of the previous group of addition modules through a register; the third carry is the input end of the second adder in the addition unit with the lowest bit, and the third carry is directly connected to the first output bit of the previous addition module; the first output bit is the output end of a first adder in the adding unit with the highest bit, the second output bit is the output end of a second adder in the adding unit with the highest bit, and the third output bit is the output end of a register in each adding unit;

the first group of the addition modules comprises a first carry, a first carry-out bit, a second carry-out bit and a third carry-out bit, and the last group of the addition modules comprises a first carry, a second carry, a third carry and a third carry-out bit;

for a first set of the summing modules, the third output bit is connected directly to an output port of the fast accumulator; for the other groups of the addition modules, the third carry-out bit and the second carry-out bit are both connected to the input end of the merge adder, and the output end of the merge adder is connected to the output port of the fast accumulator.

Optionally, if each group of the addition modules includes two or more addition units, the first adders in all the addition units are connected in sequence, and the second adders are connected in sequence.

Drawings

FIG. 1 is a schematic diagram of an accumulator structure of a conventional lead-in loop;

FIG. 2 is a schematic diagram of an accumulator and its front and rear input/output portions according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a fast accumulator disclosed in the embodiment of the present application;

fig. 4 is a schematic structural diagram of another fast accumulator disclosed in the embodiment of the present application;

fig. 5 is a schematic structural diagram of another fast accumulator disclosed in the embodiment of the present application;

fig. 6 is a schematic structural diagram of another fast accumulator disclosed in the embodiment of the present application;

FIG. 7 is a diagram illustrating an example of a three-stage pipeline process for an 8-bit adder in the fast accumulator disclosed in the embodiments of the present application;

FIG. 8 is a schematic diagram of a low bit width adder employed in the fast accumulator disclosed in the embodiment of the present application;

fig. 9 is a schematic diagram of a set of input feature maps with a channel number C and a convolution kernel with the channel number C according to an embodiment of the present application.

Detailed Description

In order to facilitate the technical solution of the present application, some concepts related to the present application will be described below.

In the digital signal processing system, the operations of the accumulator and its front and back input/output parts can be described as follows, referring to fig. 2:

the outputs from the rest of the system are first input to an adder (Σ), and the addition result can be summarized into two categories: one temporary sum and two temporary sums. Referring to FIG. 2 (a), one way temporary sum is TS (temporal sum)Referring to (b) in FIG. 2, two temporary sums are TS in the graph₁And TS₂. For more multi-path temporal sums, it can be compressed into two-path temporal sums by a 4to2 compressor array, etc., i.e., the case shown in fig. 2 (b) can be returned. Wherein existing 4to2 compressors can be used in the present application.

Assume that the accumulated output of the accumulator does not exceed N bits. Assuming a total of T sets of inputs, the T sets of inputs are accumulated for T consecutive cycles to obtain two partial sums: one is the partial sum held by the sum bit register (i.e., the register holding the value on each bit adder "sum line"): s = S_N-1,...,s₂,s₁,s₀Second, is the partial sum held by the carry register (i.e., the register holding the value on the carry chain of each bit adder): C. the two partial sums are then added by a combining adder to obtain the final output. The manner in which the two partial sums are generated in the accumulator, and the manner in which the partial sums are added, are different for different requirements of the system.

For a path of temporary sum input TS, assume that its N bits are respectively represented as I_N-1,...,I₂,I₁,I₀In order to meet the critical path delay requirement of the system and reduce the path delay of the accumulator, a first embodiment of the present application provides a fast accumulator.

Referring to fig. 3 and 4, a fast accumulator disclosed in a first embodiment of the present application includes: the system comprises a plurality of groups of adding modules which are connected in sequence, wherein each group of adding modules at least comprises an adding unit, each adding unit comprises an adder and a register, and the adders are circularly connected with the input end and the output end of the register.

Except the first group and the last group of the addition modules, each other group of the addition modules comprises a first carry I, a second carry II, a first output I 'and a second output II'; the first carry I is used for receiving and inputting data of corresponding bit positions in the data to be accumulated, and the second carry II is directly connected with the second output position II' of the previous group of addition modules through a register.

The first group of the addition modules comprises a first carry I, a first carry I ' and a second carry II ', and the last group of the addition modules comprises a first carry I, a second carry II and a first carry I '.

For a first group of said summing modules, said first output bit i' is connected directly to an output port of said fast accumulator; for the other groups of the addition modules, the first carry-out bit I' and the second carry-out bit II are both connected to the input end of the merging adder, and the output end of the merging adder is connected to the output port of the fast accumulator.

In fig. 3 and 4, signs for each carry and carry are indicated only in one of the addition blocks, but these signs are common to each addition block. In the figure, "+" indicates an adder, and "D" indicates a register.

If each group of the addition modules comprises two or more addition units, the adders in all the addition units are connected in sequence.

If each group of the addition modules comprises two or more addition units, the merge adder also comprises a zero insertion carry.

The number of add units included in each set of add modules is determined by the critical path requirements of the system.

In one implementation, if the critical path requirement of the system does not exceed 1 full adder delay, only one adding unit is included in each group of adding modules, i.e., registers are inserted in each bit carry chain of N-bit adders in the accumulator, as shown in fig. 3. After the T round of accumulation is finished, the register registers a part sum C consisting of all carry bits of N-1 bits, the C and the high N-1 bits of the part sum S are added through a merging adder (usually specially designed to meet the overall critical path requirement of the accumulator, which is described in detail below), and the addition result and the sum of the lowest bit are added

Of direct output

Splicing to obtain the accumulator output S’=s’ _N-1,...,s’ ₂,s’ ₁,s’ ₀。

In another implementation, if the critical path requirement of the system does not exceed 2 full adder delays, each group of adding modules includes two adding units, i.e., N-bit adders in the accumulator are divided into one group of two bits, and a register is inserted into a carry chain between each group, as shown in fig. 4. After T round of accumulation is finished, only one carry is cached in each two-bit addition, so that 0 needs to be inserted before C and S are accumulated to form C = C_N-1,...,0,c₄,0,c₂The high N-2 bits of the sum S are added by a merge adder to obtain a sum S₁、S₀S of direct output’ ₁,s’ ₀Splicing to obtain the final accumulator output S’。

If the critical path of the system can be longer, it can be analogized as above, and more bit adders are used as a group, and the carry line between the adders of each group is inserted into a register for operation, and for the bit number of the non-inserted register, the final addition operation is completed by inserting 0 into the corresponding bit of the partial sum addition.

Input TS for two-way temporal sum₁And TS₂Assume that its respective bits are represented asa _N-1,...,a ₂,a ₁,a ₀Andb _N-1,...,b ₂,b ₁,b ₀in order to meet the critical path delay requirement of the system and reduce the path delay of the accumulator, a second embodiment of the present application provides a fast accumulator.

Referring to fig. 5 and 6, a fast accumulator disclosed in the second embodiment of the present application includes: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises a first adder, a second adder and a register which are sequentially connected, and the output end of the register is connected to the input end of the second adder. It should be noted that "first" and "second" of the first adder and the second adder are defined according to the transmission direction of the input data, and the data is initially input into the first adder and then transmitted from the first adder to the second adder.

Except for the first group and the last group of the addition modules, each other group of the addition modules comprises a first carry I, a second carry II, a third carry III, a first output I ', a second output II ' and a third output III '.

The first carry I is the input end of the first adder in each adding unit and is used for receiving and inputting data of corresponding bits in data to be accumulated, the second carry II is the input end of the first adder in the adding unit with the lowest bit, and the second carry II is directly connected with the second output bit II' of the previous group of adding modules through a register; the third carry-in bit III is the input end of the second adder in the adding unit with the lowest carry-in bit, and the third carry-in bit III is directly connected to the first carry-out bit I' of the previous adding module; the first output bit I ' is the output end of a first adder in the highest-order addition unit, the second output bit II ' is the output end of a second adder in the highest-order addition unit, and the third output bit III ' is the output end of a register in each addition unit.

The first group of the addition modules comprises a first carry I, a first output I ', a second output II' and a third output III ', and the last group of the addition modules comprises a first carry I, a second carry II, a third input III and a third output III'.

For a first group of the summing modules, the third output bit iii' is connected directly to an output port of the fast accumulator; for the other groups of the addition modules, the third carry-out bit III' and the second carry-out bit II are both connected to the input end of the merge adder, and the output end of the merge adder is connected to the output port of the fast accumulator.

In fig. 5 and 6, signs for each carry and carry are indicated only in one of the addition blocks, but these signs are common to each addition block. In the figure, "+" indicates an adder, and "D" indicates a register.

If each group of the addition modules comprises two or more addition units, the first adders in all the addition units are connected in sequence, and the second adders are connected in sequence.

In one implementation, if the critical path requirement of the system does not exceed 2 full adder delays, only one adding unit is included in each group of adding modules, that is, a register is inserted in the carry chain of the adder for the local accumulation (i.e., the second carry corresponding to the first adder) in each bit of the two adders (i.e., the first adder and the second adder, respectively, for completing the accumulation of the local result and the addition of the temporary sum), as shown in fig. 5. After T round accumulation is finished, the register registers partial sum C formed by all carry bits of N-1 bits, the partial sum C and the high N-1 bits of S are added through a merging adder, and the addition result and the sum of the lowest bit are added

Of direct output

If the critical path requirement of the system does not exceed 3 full-adder delays, the same circuit as shown in fig. 5 can meet the requirement of not more than 3 full-adder delays with the minimum hardware cost.

If the critical path requirement of the system does not exceed 4 full adder delays, each group of addition modules includes two addition units, i.e. each group of 2-bit addition is used as a group, and a register is inserted between each group on the adder carry chain (i.e. the second carry corresponding to the first adder in the first addition unit) for accumulating the result of the current bit. As shown in FIG. 6, T rounds of accumulationAfter finishing, each 2-bit addition buffers 1 carry, so that 0 needs to be inserted into the bit without buffering carry before C and S are accumulated, and then the carry is added with the high N-2 bit of S through the merging adder. Obtained sum and s₁,s₀Directly output s’ ₁,s’ ₀Splicing to obtain the final accumulator output S’。

In fig. 3 to 6, the sign of the output signal of the si signal on the adder and the line and the sign of the ci signal on the carry line after passing through the register are the same as those of the corresponding input signals. In practice, these corresponding signals are only equivalent (equal in value) when they differ in time by one period.

In the fast accumulators disclosed in the first and second embodiments, the critical paths of the C and S parts and the adder may be optimized according to the following schemes to ensure that the critical path of the accumulator module as a whole can meet the system clock requirement.

One solution is that in the optimized accumulator the combining adder for adding the partial sum C and S is located outside the loop, involving only a pure forward path. For adapting the critical path delay constraint, a multi-stage pipeline may be inserted into the merged adder appropriately, for example, referring to fig. 7, for 8-bit partial sum addition, a pipeline may be inserted into each 3-bit adder, and the adder between two adjacent stages of pipelines may select some fast addition implementation with low bit width according to the constraints of timing, area, and the like.

Alternatively, considering that the introduction of a new merge adder into the optimized accumulator will bring extra hardware consumption, for a circuit with strict area constraint, the hardware complexity can be reduced in the following two ways.

1) According to the fast accumulator structures of fig. 3 to 6, 0 is inserted into the partial sum C of the input merge adder, and accordingly, the output logic of the merge adder can be re-derived, thereby simplifying the design.

2) The merging adder may adopt a low bit width adder to group addends and perform iterative addition for multiple times to complete the output of the final addition result. Assuming a low bit width adder of 4 bits, addAnd the sum of the two parts is 8 bits, the implementation is shown in fig. 8, T_selIndicating a selection signal for selecting whether to use the upper four bits or the lower four bits addend, C_outRepresenting the carry output signal at the previous instant.

The key point of the method is to introduce a new accumulator implementation scheme into a digital signal processing system, so that the critical path is obviously reduced, and the upper frequency limit of the system is improved. The main key points are as follows:

firstly, by inserting a register on a carry chain between groups after grouping, combining each bit adder and the register inserted on a line, the generation of the output of an accumulator can be divided into two steps, and the first step of accumulation generates two partial sums which are respectively: s = S held by each sum line register_N-1,...,s₂,s₁,s₀And C stored in each carry register. The second step is to perform the addition for the two aforementioned partial sums by an adder for partial sum addition.

Based on the first key point, the insertion scheme of the carry chain register can be adjusted according to the requirement of the key path of the system, so that the constraint of the key path of the accumulator module is met with the least additional hardware consumption.

And thirdly, based on the first key point, adders for partial sum addition can be properly inserted into multi-stage pipelines, and fast adders (such as carry selection adders, carry-ahead adders and the like) can be adopted among the pipelines so as to meet the constraint of a key path of an accumulator module.

Based on the main key points, the problem of key path bottleneck caused by a loop in the accumulator can be fundamentally solved. The pure forward path outside the accumulator can realize the increase of the system frequency by a conventional method such as inserting a pipeline and the like.

With reference to fig. 9, a typical example of a convolution operation performed by a set of input feature maps and corresponding convolution kernels in a Convolutional Neural Network (CNN) is described.

Fig. 9 shows a schematic diagram of a set of input feature maps with the number of channels C performing convolution operation with convolution kernels with the number of channels C, and dashed arrows show a corresponding relationship between a system module and a module in the design scheme of the present application. In fig. 9, the input channel dimension parallel scheme is adopted for multiplication, and for simplicity, the selected parallelism is 4. After a group of four pixel values (pixel values are values of one point on the input characteristic diagram) and corresponding weight value multiplication operations are finished, two temporary sums (temporal sums) are obtained from the results through a 4to2 compressor array and are input into an improved accumulator for accumulation. Under the common convolution kernel 3 × 3 configuration, one pixel value of the next set of feature maps can be obtained after accumulating 9C/4 times. Setting the input of the accumulator to be 32bit, respectively implementing the accumulator according to the architectures of the above fig. 4 and 5, and performing time sequence analysis by using the SMIC 55nm library, wherein the specific results are shown in the following table:

TABLE 1 Critical Path comparison before and after accumulator optimization

Through comprehensive analysis, the path delay of the 32bit accumulator deeply optimized by the EDA tool in the prior art is 2.17ns, and the maximum frequency of the whole CNN accelerator system is 460.8 MHz. Through the rapid accumulator disclosed by the application, the path delay of the module can be reduced to 1.10ns, the scheme of pipeline processing of other modules of a system is synchronously matched, the system frequency can be increased to 909.1MHz to the maximum, and the throughput rate can be increased to the maximum by 1.97 times.

Claims

1. A fast accumulator, comprising: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises an adder and a register, and the adder is circularly connected with the input end and the output end of the register;

the first group of the addition modules comprises a first carry, a first carry and a second carry, and the last group of the addition modules comprises a first carry, a second carry and a first carry;

2. The fast accumulator of claim 1, wherein if each set of the adding modules comprises two or more of the adding units, the adders in all the adding units are connected in sequence.

3. The fast accumulator of claim 1 or 2, wherein if each group of the addition modules comprises two or more of the addition units, the merge adder further comprises a zero-inserted carry.

4. A fast accumulator, comprising: the system comprises a plurality of groups of sequentially connected addition modules, wherein each group of addition modules at least comprises an addition unit, the addition unit comprises a first adder, a second adder and a register which are sequentially connected, and the output end of the register is connected to the input end of the second adder;

5. The fast accumulator of claim 4, wherein if each set of the adding modules comprises two or more adding units, the first adders and the second adders of all the adding units are connected in sequence.

6. The fast accumulator of claim 4 or 5, wherein if each group of the addition modules comprises two or more of the addition units, the merge adder further comprises a zero-inserted carry.