CN114239819A - DSP-based hybrid bit width accelerator and fusion calculation method - Google Patents
DSP-based hybrid bit width accelerator and fusion calculation method Download PDFInfo
- Publication number
- CN114239819A CN114239819A CN202111605030.1A CN202111605030A CN114239819A CN 114239819 A CN114239819 A CN 114239819A CN 202111605030 A CN202111605030 A CN 202111605030A CN 114239819 A CN114239819 A CN 114239819A
- Authority
- CN
- China
- Prior art keywords
- multiplicand
- multiplier
- dsp
- width
- signal input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Mathematical Optimization (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Analysis (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Neurology (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Complex Calculations (AREA)
Abstract
The invention provides a mixed bit width accelerator based on a high bit width DSP and a fusion calculation method, wherein the DSP is used as a main calculation unit, a multiplier and a multiplicand are respectively connected, and shifting and inserting different isolation bit widths are respectively carried out, so that multiple groups of arbitrary low bit width multiplication accumulation operation can be realized. The accelerator supports any multiplication parallelism, and maximizes the DSP computing performance; the multiplier and the multiplicand with any bit width are supported, and the fixed and unfixed conditions of the multiplier are supported, so that the universality is better, and the application range is wide.
Description
Technical Field
The invention relates to a hybrid bit width neural network accelerator design, in particular to a hybrid bit width accelerator based on a high bit width DSP and a fusion calculation method.
Background
The DSP mentioned in the invention refers to DSP IP in FPGA, namely IP for realizing digital signal processing technology. The advantage of DSP is the high speed and real time nature of the processed signal. The DSP is internally provided with a special hardware multiplier and can be used for quickly realizing various digital signal processing algorithms. Taking the DSP48E1 IP of Xilinx 7 series FPGA as an example, the internal structure mainly includes a multiplier with a bit width of 25 × 18 and an accumulator with 48 bits, so that operations such as multiplication, multiplication accumulation, bitwise logic operation and the like can be realized without the help of FPGA resources, and the application in the communication fields such as graphic image processing, voice processing, signal processing and the like is wide.
In order to improve the calculation speed of the neural network and relieve the parameter storage pressure, a large amount of quantification methods are used in the deep neural network. The bit width of the quantized network parameter is reduced, so that the calculation efficiency is improved, the storage resource overhead is reduced, the energy consumption is reduced, and the hardware performance is greatly improved. Because different network layer parameters have different degrees of influence on the loss function, if the neural network is quantized with uniform quantization precision, if the quantization precision is too low, the network performance is greatly reduced, and if the quantization precision is too high, the advantage of quantization cannot be fully exerted. Based on this consideration, hybrid bit width quantization needs to be explored to optimize the network quantization problem. And selecting quantization with different precisions for different layers of the network based on the quantization method for the mixed bit width quantization. Memory computation is very common for mixed bit width quantization, on one hand, parameters with lower bit width are friendly to limited memory capacity, and on the other hand, supporting mixed bit width quantization actually becomes a demand of a network model generated by an upper layer on a neural network accelerator, because the network model of the upper layer is an effective method for increasing operation speed and reducing quantization bit width. The hybrid bit width quantization may significantly reduce the number of parameters while ensuring accuracy.
Studies have shown that the same accuracy is maintained, no floating point calculations are required in deep learning inferences, and for many applications such as image classification, only INT8 or lower fixed point computational accuracy is required to maintain acceptable inference accuracy. Under the background of reduced quantization precision, the fixed bit width of the DSP inevitably causes a waste of computing resources. And because the neural network has the characteristics of short update iteration period, quick change and large difference of quantization parameters of different networks, designing a corresponding low-bit-width DSP for a specific network is unrealistic. One solution is to try to implement multiple low bit-width multiply-accumulate operations in parallel using a high bit-width DSP. In this respect, Xilinx proposes a parallel method under int8 quantization based on its DSP48E2, and implements a scheme of implementing two 8-bit-wide multiplications in parallel with a 27 × 18 DSP, thereby achieving 1.75 times of improvement of peak computation performance. But it has problems that: 1) the neural network quantization bit width only considers 8 bits, while for a network with a mixed bit width quantization, quantization parameters with lower bit width (such as 4 bits, 2 bits, 1bit and the like) are common; 2) only the condition of weight sharing is considered, namely multipliers in two groups of parallel multiplications are fixed, so the universality is not strong; 3) limited by its 8-bit quantization bit width considerations, no implementation of higher parallelism (i.e., implementing more sets of multiplications simultaneously) has been proposed.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a hybrid bit width accelerator based on a high bit width DSP and a fusion calculation method, which support two situations of fixed multiplier and unfixed multiplier, and have better universality; the multi-group arbitrary low bit width multiply-accumulate operation can be realized in parallel, and the DSP calculation performance is maximized.
The invention is realized by the following technical scheme:
a mixed bit width accelerator based on DSP comprises a first signal input end, a second signal input end, a third signal input end, a fourth signal input end, a first shifting unit, a first adder, a first selector, a third selector, a second shifting unit, a second adder, a second selector, a fourth selector and a DSP;
the first signal input end and the second signal input end are respectively used for receiving a multiplicand, and the third signal input end and the fourth signal input end are respectively used for receiving a multiplier;
when the parallelism is 1, a multiplicand at the first signal input end directly enters the DSP through the first selector, a multiplier at the third signal input end directly enters the DSP through the second selector, and the multiplicand and the multiplier are subjected to multiplication accumulation operation in the DSP;
when the parallelism is more than 1, the multiplicand of the first signal input end is shifted in the first shifting unit, the multiplicand enters the first adder and is added with the multiplicand of the second signal input end after the shifting operation, the multiplicand obtained after the adding operation enters the third selector, the multiplicand enters the first adder through the third selector, and the multiplicand continues to be added with the multiplicand of the first signal input end after the shifting operation until all the multiplicands finish the adding operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the multiplier of the third signal input end is shifted in the second shifting unit, the shifted multiplier enters the second adder and is added with the multiplier of the fourth signal input end, the multiplier obtained after the addition enters the fourth selector, the multiplier enters the second adder through the selection of the fourth selector, and the multiplier continues to be added with the multiplier of the third signal input end after the shifting operation until all the multipliers finish the addition operation; after all multipliers finish addition operation, the high-bit-width multiplier obtained by the second adder enters the DSP through the selection of the second selector; and the high bit width multiplicand and the high bit width multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
Preferably, when the parallelism is greater than 1, the shift number calculation method when performing the shift operation is:
setting the multiplicand bit width of the first signal input end as n1, setting the multiplier bit width entering the first adder through the third selector as n2, setting the multiplier bit width of the third signal input end as n3, and setting the multiplier bit width entering the second adder through the fourth selector as n 4; the number of shifts for n1 and n3 is x + n4 and y + n2, respectively, where x, y, n1, n2, n3, and n4 satisfy the following constraints:
x+y+2(n2+n4)≥max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)。
preferably, a redundant bit width is inserted before each multiplicand of the high bit width multiplicands obtained by the first adder.
Preferably, the method for calculating the bit number of the redundant bit width comprises the following steps:
if the number of bits of the redundant bit width inserted before each multiplicand is q, the calculation formula of q is as shown in formula (6)
Wherein m is the highest bit width of the DSP, d is the parallelism, and mult1 and mult2 are the bit widths of the high bit width multiplicand obtained by the first adder and the high bit width multiplier obtained by the second adder, respectively.
Further, when the parallelism is 2, the multiplicand bit width of the first signal input end is set to be n1, the multiplier bit width entering the first adder through the third selector is set to be n2, the multiplier bit width of the third signal input end is set to be n3, and the multiplier bit width entering the second adder through the fourth selector is set to be n 4;
setting the supportable maximum accumulation times of the redundant bit width to be num _ acc, wherein a calculation formula is shown as a formula (3), max _ w1 is the maximum bit width of a high-order partial product, a middle partial product and a low-order partial product, the calculation formula is shown as a formula (4), mp _ w is the bit width of the middle partial product, and the calculation formula is shown as a formula (5);
max_w1=max(n1+n3,n2+n4,mp_w) (4)
mp_w=max(n1+n4+x+n4+n2,n2+n3+y+n2+n4)+1-min(x+n2+n4,y+n2+n4) (5)
wherein x, y, n1, n2, n3, and n4 satisfy the following constraints:
x+y+2(n2+n4)≥max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)。
further, when the parallelism is greater than 2, the multiple groups of multiplications are combined into two groups of multiplications, and then the maximum accumulation times are calculated by using the method of processing the parallelism to be 2.
Preferably, when the parallelism is greater than 1 and is a fixed multiplier, the multiplicand at the first signal input end is subjected to shift operation in the first shift unit, the multiplicand enters the first adder and is subjected to addition operation with the multiplicand at the second signal input end after the shift operation, the multiplicand obtained after the addition operation enters the third selector, the multiplicand is selected by the third selector to enter the first adder, and the addition operation is continued to be performed with the multiplicand at the first signal input end after the shift operation until all the multiplicands finish the addition operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the fixed multiplier of the third signal input end directly enters the DSP through the second selector; the high bit width multiplicand and the fixed multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
Further, the shift operation is specifically: the multiplier bit width is inserted after the multiplicand by 0.
Further, a redundant bit is inserted before each multiplicand in the high bit width multiplicand obtained by the first adderWidth; the parallelism is set to be d, the bit width of a fixed multiplier is set to be n, and the calculation formula of the maximum accumulation times num _ acc which can be supported by the redundant bit width is as shown in a formula (9); wherein max _ w2 represents the sum of the maximum bit width in the multiplicand and the multiplier bit width; q is the number of bits of the redundant bit width inserted before each multiplicand, and the calculation formula is as shown in formula (10), wherein niRepresents the multiplicand bit width;
a mixed bit width fusion calculation method based on DSP comprises that based on the accelerator,
when the parallelism is 1, a multiplicand at the first signal input end directly enters the DSP through the first selector, a multiplier at the third signal input end directly enters the DSP through the second selector, and the multiplicand and the multiplier are subjected to multiplication accumulation operation in the DSP;
when the parallelism is more than 1, the multiplicand of the first signal input end is shifted in the first shifting unit, the multiplicand enters the first adder and is added with the multiplicand of the second signal input end after the shifting operation, the multiplicand obtained after the adding operation enters the third selector, the multiplicand enters the first adder through the third selector, and the multiplicand continues to be added with the multiplicand of the first signal input end after the shifting operation until all the multiplicands finish the adding operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the multiplier of the third signal input end is shifted in the second shifting unit, the shifted multiplier enters the second adder and is added with the multiplier of the fourth signal input end, the multiplier obtained after the addition enters the fourth selector, the multiplier enters the second adder through the selection of the fourth selector, and the multiplier continues to be added with the multiplier of the third signal input end after the shifting operation until all the multipliers finish the addition operation; after all multipliers finish addition operation, the high-bit-width multiplier obtained by the second adder enters the DSP through the selection of the second selector; and the high bit width multiplicand and the high bit width multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
Compared with the prior art, the invention has the following beneficial effects:
the invention uses DSP as the mixed bit width neural network accelerator of the main calculating unit, connects the multiplier and the multiplicand respectively, and shifts and inserts different isolation bit widths respectively, thereby realizing the multiply-accumulate operation of multiple groups of arbitrary low bit widths. The accelerator supports any multiplication parallelism, and maximizes the DSP computing performance; the multiplier and the multiplicand with any bit width are supported, and the fixed and unfixed conditions of the multiplier are supported, so that the universality is better, and the application range is wide.
Drawings
FIG. 1 illustrates a high bit width multiplier implementing two low bit width multiplication schemes simultaneously;
FIG. 2 illustrates a high bit width multiplier implementing two low bit width multiplication data formats simultaneously;
FIG. 3 is a diagram of a high bit width multiplication implementing three low bit width multiplication schemes simultaneously; (a) the high bit width multiplication simultaneously realizes the first steps of three low bit width multiplication schemes; (b) the second step of the three low bit width multiplication schemes is realized simultaneously by the high bit width multiplication;
FIG. 4 is a middle partial product bit width analysis;
FIG. 5 is a diagram of a high bit width multiplication implementation with two same multiplier multiplication interpolation schemes;
FIG. 6 is a diagram of a high bit width multiplication implementation with three same multiplier multiplication interpolation schemes;
FIG. 7 is a diagram of a data format for implementing three multiplicand multiplications for high bit-width multiplication;
FIG. 8 is a PE allocation in a network computation;
FIG. 9 is a PE internal structure;
fig. 10 shows the internal structure of the DSP.
Detailed Description
For a further understanding of the invention, reference will now be made to the following examples, which are provided to illustrate further features and advantages of the invention, and are not intended to limit the scope of the invention as set forth in the following claims.
First, a scheme for implementing two low bit width multiply-accumulate operations by a high bit width DSP is described, as shown in fig. 1. For two sets of low bit width multiplications to be performed, the multiplicand and multiplier have bit widths n1 and n3, n2 and n4, respectively. The two multiplicands and the multiplier form multiplication with two high bit widths by a shift splicing method, and the shift numbers during splicing are x + n4 and y + n2 respectively. Wherein x, y, n1, n2, n3, n4 satisfy the following constraints:
x+y+2(n2+n4)≥max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)
under the constraint, the result of multiplying the high bit width data in the right half of fig. 1 is obtained by calculating the result of multiplying the high bit width data in the right half of fig. 1, and the high (n1+ n3) bit and the low (n2+ n4) bit are respectively taken from the result, so that the result of multiplying the two groups of low bit width data in the left half of fig. 1 (i.e., n1 × n3 and n2 × n4) can be obtained simultaneously.
In order to prevent overflow in consideration of the accumulation operation after multiplication, in the data format of the right half of fig. 1, extra bit widths need to be inserted in front of the left calculation data and the right calculation data of each multiplicand, respectively. Taking the insertion of 2-bit width as an example, the final data format is shown in fig. 2.
And (3) setting the maximum redundant bit width which can be allocated to each multiplicand as q, and then calculating the q according to the formula (2). Wherein m is the highest bit width of the DSP multiplier, d is the parallelism, that is, one DSP simultaneously performs several groups of multiplication and accumulation.
Setting the supportable maximum accumulation times of the redundant bit width as num _ acc, wherein a calculation formula is shown as a formula (3), q is the maximum redundant bit width which can be allocated to each multiplicand, and the calculation formula is shown as a formula (2); max _ w1 is the maximum bit width of the upper, middle and lower partial products, and the calculation formula is shown in equation (4), where mp _ w is the bit width of the middle partial product, and the calculation formula is shown in equation (5).
max_w1=max(n1+n3,n2+n4,mp_w) (4)
mp_w=max(n1+n4+x+n4+n2,n2+n3+y+n2+n4)+1-min(x+n2+n4,y+n2+n4) (5)
Consider next a scenario where one DSP simultaneously performs more (greater than 2) multiply-accumulate operations, as shown in fig. 3. For three sets of low bit width multiplications to be performed, the multiplicand and multiplier have bit widths n1 ', n 2' and n3 ', n 4', n5 'and n 6', respectively. The basic idea of the scheme is to combine three groups of multiplications into two groups of multiplications and then sequentially process the multiplications by using a scheme for processing the two groups of multiplications in parallel. As shown in fig. 3(a), first, n1 ═ n1 '+ n 2', n3 ═ n4 '+ n 5', n2 ═ n3 'and n4 ═ n 6' are respectively substituted into formula (1), so as to obtain the sizes of x1 and y 1; in the second step, n1 ═ n1 ', n3 ═ n 4', n2 ═ n2 '+ x1+ n 6' + n3 ', n4 ═ n 5' + y1+ n6 '+ n 3' are respectively substituted into formula (1), and the sizes of x2 and y2 are obtained, as shown in fig. 3 (b).
The multiplicand in the divided multiplication operation comprises 5 parts, wherein redundant bit widths are required to be respectively reserved in front of three actual multiplicands for preventing overflow during accumulation operation. The method for calculating the redundant bit width and the maximum accumulation times can be obtained by analogy with the scheme with the parallelism degree of 2.
And (4) setting the number of bits inserted into the maximum redundant bit width as q, and then calculating the q according to the formula (6). Wherein m is the highest bit width of the DSP multiplier; d is the parallelism, i.e. one DSP simultaneously makes several groups of multiplication and accumulation; mult1 and mult2 are bit widths of two multipliers after shift splicing, namely
mult1=n1’+n2’+n3’+x1+x2+n3’+n5’+n6’+y1+n6’
mult2=n4’+n5’+n6’+y1+y2+n2’+n3’+n6’+x1+n3’。
The supportable accumulation times of the redundant bit width are num _ acc, a calculation formula is shown as a formula (3), wherein q is the maximum redundant bit width which can be allocated to each multiplicand, and the calculation formula is shown as a formula (6); max _ w1 is the maximum bit width of the upper, middle and lower partial products, and the calculation formula is as formula (8), where mp _ w is the bit width of the middle partial product. The parallelism is 3 at this time, so mp _ w should comprise two part calculations. First, calculate the partial product of the gray part in fig. 4, the green part is two groups of multiplications, and the analysis is performed according to the parallelism of 2. Substituting n 1-n 2 ', n 2-n 3', n 3-n 5 ', n 4-n 6', x-x 1, and y-y 1 into formula (5) respectively to obtain mp _ w 1; and analyzing the partial products after adding the high bits, namely regarding the green part in the graph 4 as one group of multiplication and regarding n1 '. multidot.n 2' as another group of multiplication, and at the moment, analyzing according to the parallelism degree of 2. N1 ═ n1 ', n2 ═ n 2' + x1+ n6 '+ n 3', n3 ═ n4 ', n4 ═ n 5' + y1+ n3 '+ n 6', x ═ x2, and y ═ y2 are respectively substituted into formula (5), so that mp _ w2 is obtained; then mp _ w equals max (mp _ w1, mp _ w2), and the maximum bit width max _ w1 of the partial product with parallelism of 3 can be obtained by substituting equation (8); at this time, max _ w1 is substituted into equation (7), so as to obtain the maximum accumulation number.
max_w1=max(n1′+n4′,n2′+n5′,n3′+n6′,mp_w) (8)
For the realization of higher parallelism, firstly, according to the figure 3, the data are merged and then analyzed according to the condition that the parallelism is 2, and the shift bit width when each multiplier is spliced is obtained; then, by analogy of the redundant bit width and maximum accumulation times analysis method with the parallelism of 3, the multipliers are combined and reduced to the parallelism of 2 in sequence, and the bit number and the maximum accumulation times of the redundant bit width under the corresponding high parallelism are obtained through calculation.
Then, the condition that one multiplier in the multiple groups of multiplications is fixed is considered, and the weights are shared when the weights are calculated by corresponding actual networks. For the case of two parallel sets of multiplications, the method of 0 value insertion at this time is shown in FIG. 5. Multiplying the numbers of two bit widths n1 and n2 by the same multiplier with the bit width n3, and only inserting n 30 before n2 to isolate; for the case of three groups of parallel multiplications, the method of inserting 0 value is shown in fig. 6, the bit width of the fixed multiplier is n4, and n4 0 values need to be inserted between every two of the three multiplicands; similarly, for the condition of more groups of multiplication paralleling, the isolation effect can be achieved only by inserting the bit width of the multiplier 0 between every two multiplicands. Finally, a redundant bit is inserted before each multiplicand to prevent accumulation overflow, and a final data format can be obtained, as shown in fig. 7, in which the redundant bit width is 2 as an example.
Under the condition that one multiplier is fixed, the parallelism is d, the bit width of the fixed multiplier in parallel calculation is n, and the calculation formula of the maximum accumulation times num _ acc is as shown in the formula (9). Wherein max _ w2 represents the sum of the maximum bit width in the multiplicand and the multiplier bit width; q is the maximum redundant bit width that can be allocated to each multiplicand, and the calculation formula is as shown in formula (10), where n isiIndicating the multiplicand bit width.
When the accumulation times reach the limit of the redundant bit width, additional DSPs are required to be introduced for continuous accumulation. The number of additional DSPs introduced is d-1(d is parallelism). From this the acceleration ratio r of the DSP can be calculated, as in equation (11).
The invention takes Xilinx 7 series FPGA chips as a platform to realize the neural network accelerator. The computing array of the accelerator mainly comprises 8 × 9 PEs (processing elements), wherein the computing function of each PE is realized based on a DSP48E1 of a Xilinx 7 series chip, the type of DSP includes 25 × 18 multiplier bit widths and 48 accumulator bit widths.
The PE allocation in the neural network computation process is illustrated in fig. 8 as an example. The figure shows the implementation of one layer of convolution, the size of the characteristic diagram is 80 channels 13 × 13, the number of convolution kernels is 64, and each convolution kernel is 80 channels 3 × 3. Taking the case of sharing the weight as an example, each row of PEs calculates one channel of the convolution kernel, and each PE receives one element from the convolution kernel and receives d values from the feature map according to the parallelism d set by the DSP, where fig. 8 shows a case where the parallelism is 2, and each PE receives two values from the feature map at the same time, which is equivalent to two convolutions being done at the same time. The data flow inside the PE at this time is as shown in fig. 9. Inputting a sharing weight value into a port I, and inputting the sharing weight value into a DSP port B through a multi-selector sel 1; one of the two values of the characteristic diagram is input into the port (c), the other value is input into the port (c), the value of the port (c) is added with the value of the port (c) after being shifted to obtain a splicing result, the spliced value is input into the DSP port A through a multi-selector sel2, and then multiplication is carried out by the DSP. The internal structure of the DSP is shown in fig. 10, and the four input ports B, A, D, C are 18, 30, 25, and 48 bits wide, respectively. For the condition of higher parallelism, the value of the adder after the first splicing is input to one end of the adder through the selector sel4, a new multiplier enters the other end of the adder after being shifted from the port III, and a secondary splicing result is obtained after the addition. For the case of the non-fixed multiplier, the high-parallelism multiplication of the non-fixed multiplier can be realized only by splicing the multipliers at the ports (i) and (ii) and (iii) of fig. 9 according to the method, and then inputting the splicing results to the port B and the port A of the DSP through sel1 and sel2 respectively.
Based on the parallel computing method, for the case of the non-fixed multiplier, taking the multiplicand and the multiplier bit width as 2 as an example: when the parallelism is 2, in this case, n1, n2, n3, n4, 2, the formula (1) is substituted, and the minimum value x of x and y satisfying the condition, y, is 1; equations (2) to (5) are further substituted to obtain q ═ 9, mp _ w ═ 5, max _ w1 ═ 5, and num _ acc ═ 528, respectively. Substituting d in 2 and num _ acc in 528 into equation (11) to obtain an acceleration ratio of 1.996; when the parallelism is 3, according to the procedure of fig. 3(a), n1, n3, n2, n4, and 2 are substituted into formula (1), and the minimum value x1, y1, of x and y satisfying the condition is taken as 3; proceeding with the step of fig. 3(b), it is found that n4 ' + n2 ' + n3 ' + n6 ' + x1+ y1+ n5 ' + n3 ' + n6 ' ═ 20>18, which exceeds the multiplier bit width of the DSP, so that the DSP48E1 cannot support 2-bit multiplication with 3-parallelism;
taking the multiplicand and the multiplier bit width as an example of 1, when the parallelism is 3, according to the step of fig. 3(a), substituting n 1-n 3-2, n 2-n 4-1 into formula (1), and taking the minimum value of x and y satisfying the condition to obtain x 1-y 1-1; the procedure of fig. 3(b) is further performed, where n1 ═ n3 ═ 1, n2 ═ n4 ═ 4 are substituted into formula (1), and the minimum value of x and y satisfying the conditions is taken, so that x2 ═ y2 ═ 3 is obtained. At this time, n2 '+ n 3' + n6 '+ x1 and n 5' + n3 '+ n 6' + y1 (for isolating the higher product and isolating the lower product from influence) are inserted to isolate the lower product and prevent the higher product from being influenced by the lower product, so that x2 ═ y2 ═ 0 (that is, when x and y calculated according to equation (1) are negative numbers, both of them may be 0). Equations (6) to (8) are further substituted to obtain q 5, mp _ w 13, mp _ w2 6, max _ w1 6, and num _ acc 32, respectively. The acceleration ratio at this time is 2.824 by substituting d in 3 and num _ acc in 32 into equation (11).
For the case of fixed multiplier, taking 4 bits as an example for both multiplier and multiplicand, the support parallelism is 2 and 3, and the acceleration ratio is 1.969 and 1.5 respectively; for the case that both the multiplier and the multiplicand are 3 bits, the support parallelism is 2, 3 and 4, and the acceleration ratio is 1.992, 2.4 and 1.6 respectively; for the case that both the multiplier and the multiplicand are 2bit, the support parallelism is 2, 3, 4 and 5, and the acceleration ratio is 1.996, 2.833, 2.286 and 1.667 respectively; for the case where both the multiplier and multiplicand are 1bit, the support parallelism is 2, 3, 4, 5, 6, 7, 8, and the acceleration ratios are 1.999, 2.931, 3.5, 3.571, 3, 1.75, 1.778, respectively.
The present invention is compared to prior art methods, as shown in table 1.
TABLE 1 comparison of the present invention with existing methods
Method | Degree of parallelism | Multiplier bit width | Multiplier limit |
Xilinx int8 | 2 | 8 | Fixing a multiplier |
The invention | Arbitrary | Arbitrary | Arbitrary multiplier/fixed multiplier |
Claims (10)
1. A mixed bit width accelerator based on DSP is characterized by comprising a first signal input end, a second signal input end, a third signal input end, a fourth signal input end, a first shifting unit, a first adder, a first selector, a third selector, a second shifting unit, a second adder, a second selector, a fourth selector and a DSP;
the first signal input end and the second signal input end are respectively used for receiving a multiplicand, and the third signal input end and the fourth signal input end are respectively used for receiving a multiplier;
when the parallelism is 1, a multiplicand at the first signal input end directly enters the DSP through the first selector, a multiplier at the third signal input end directly enters the DSP through the second selector, and the multiplicand and the multiplier are subjected to multiplication accumulation operation in the DSP;
when the parallelism is more than 1, the multiplicand of the first signal input end is shifted in the first shifting unit, the multiplicand enters the first adder and is added with the multiplicand of the second signal input end after the shifting operation, the multiplicand obtained after the adding operation enters the third selector, the multiplicand enters the first adder through the third selector, and the multiplicand continues to be added with the multiplicand of the first signal input end after the shifting operation until all the multiplicands finish the adding operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the multiplier of the third signal input end is shifted in the second shifting unit, the shifted multiplier enters the second adder and is added with the multiplier of the fourth signal input end, the multiplier obtained after the addition enters the fourth selector, the multiplier enters the second adder through the selection of the fourth selector, and the multiplier continues to be added with the multiplier of the third signal input end after the shifting operation until all the multipliers finish the addition operation; after all multipliers finish addition operation, the high-bit-width multiplier obtained by the second adder enters the DSP through the selection of the second selector; and the high bit width multiplicand and the high bit width multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
2. The DSP-based hybrid bit width accelerator of claim 1, wherein when the parallelism is greater than 1, the shift number calculation method when performing the shift operation is:
setting the multiplicand bit width of the first signal input end as n1, setting the multiplier bit width entering the first adder through the third selector as n2, setting the multiplier bit width of the third signal input end as n3, and setting the multiplier bit width entering the second adder through the fourth selector as n 4; the number of shifts for n1 and n3 is x + n4 and y + n2, respectively, where x, y, n1, n2, n3, and n4 satisfy the following constraints:
x+y+2(n2+n4)≥
max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)。
3. the DSP based hybrid bit-width accelerator of claim 1, wherein a redundant bit-width is inserted before each multiplicand in the high bit-width multiplicand resulting from the first adder.
4. The DSP-based hybrid bit-width accelerator of claim 1, wherein the redundant bit-width bit number calculation method is:
if the number of bits of the redundant bit width inserted before each multiplicand is q, the calculation formula of q is as shown in formula (6)
Wherein m is the highest bit width of the DSP, d is the parallelism, and mult1 and mult2 are the bit widths of the high bit width multiplicand obtained by the first adder and the high bit width multiplier obtained by the second adder, respectively.
5. The DSP-based hybrid bit-width accelerator of claim 4, wherein when the parallelism is 2, assuming that the multiplicand bit-width at the first signal input is n1, the multiplier bit-width entering the first adder via the third selector is n2, the multiplier bit-width at the third signal input is n3, and the multiplier bit-width entering the second adder via the fourth selector is n 4;
setting the supportable maximum accumulation times of the redundant bit width to be num _ acc, wherein a calculation formula is shown as a formula (3), max _ w1 is the maximum bit width of a high-order partial product, a middle partial product and a low-order partial product, the calculation formula is shown as a formula (4), mp _ w is the bit width of the middle partial product, and the calculation formula is shown as a formula (5);
max_w1=max(n1+n3,n2+n4,mp_w) (4)
mp_w=max(n1+n4+x+n4+n2,n2+n3+y+n2+n4)+1-min(x+n2+n4,y+n2+n4) (5)
wherein x, y, n1, n2, n3, and n4 satisfy the following constraints:
x+y+2(n2+n4)≥
max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)。
6. the DSP-based hybrid bit width accelerator of claim 5, wherein when the parallelism is greater than 2, the plurality of groups of multiplications are first combined into two groups of multiplications, and then the maximum number of accumulations is calculated using a method that handles parallelism of 2.
7. The DSP-based hybrid bit width accelerator of claim 1, wherein when the parallelism is greater than 1 and is a fixed multiplier, the multiplicand at the first signal input end performs a shift operation in the first shift unit, the multiplicand after the shift operation enters the first adder and performs an addition operation with the multiplicand at the second signal input end, the multiplicand obtained after the addition operation enters the third selector, and is selected by the third selector to enter the first adder, and the addition operation with the multiplicand at the first signal input end after the shift operation is continued until all the multiplicands complete the addition operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the fixed multiplier of the third signal input end directly enters the DSP through the second selector; the high bit width multiplicand and the fixed multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
8. The DSP-based hybrid bit-width accelerator according to claim 7, wherein the shift operation is specifically: the multiplier bit width is inserted after the multiplicand by 0.
9. The DSP based hybrid bit-width accelerator of claim 7 wherein a redundant bit-width is inserted before each multiplicand in the high bit-width multiplicand resulting from the first adder; the parallelism is set to be d, the bit width of a fixed multiplier is set to be n, and the calculation formula of the maximum accumulation times num _ acc which can be supported by the redundant bit width is as shown in a formula (9); wherein max _ w2 represents the sum of the maximum bit width in the multiplicand and the multiplier bit width; q is the number of bits of the redundant bit width inserted before each multiplicand, and the calculation formula is as shown in formula (10), wherein niRepresents the multiplicand bit width;
10. a hybrid bit width fusion calculation method based on DSP, characterized in that, based on the accelerator of claim 1,
when the parallelism is 1, a multiplicand at the first signal input end directly enters the DSP through the first selector, a multiplier at the third signal input end directly enters the DSP through the second selector, and the multiplicand and the multiplier are subjected to multiplication accumulation operation in the DSP;
when the parallelism is more than 1, the multiplicand of the first signal input end is shifted in the first shifting unit, the multiplicand enters the first adder and is added with the multiplicand of the second signal input end after the shifting operation, the multiplicand obtained after the adding operation enters the third selector, the multiplicand enters the first adder through the third selector, and the multiplicand continues to be added with the multiplicand of the first signal input end after the shifting operation until all the multiplicands finish the adding operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the multiplier of the third signal input end is shifted in the second shifting unit, the shifted multiplier enters the second adder and is added with the multiplier of the fourth signal input end, the multiplier obtained after the addition enters the fourth selector, the multiplier enters the second adder through the selection of the fourth selector, and the multiplier continues to be added with the multiplier of the third signal input end after the shifting operation until all the multipliers finish the addition operation; after all multipliers finish addition operation, the high-bit-width multiplier obtained by the second adder enters the DSP through the selection of the second selector; and the high bit width multiplicand and the high bit width multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111605030.1A CN114239819B (en) | 2021-12-24 | 2021-12-24 | Mixed bit width accelerator based on DSP and fusion calculation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111605030.1A CN114239819B (en) | 2021-12-24 | 2021-12-24 | Mixed bit width accelerator based on DSP and fusion calculation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114239819A true CN114239819A (en) | 2022-03-25 |
CN114239819B CN114239819B (en) | 2023-09-26 |
Family
ID=80762958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111605030.1A Active CN114239819B (en) | 2021-12-24 | 2021-12-24 | Mixed bit width accelerator based on DSP and fusion calculation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114239819B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000099313A (en) * | 1998-09-25 | 2000-04-07 | Denso Corp | Multiplier |
CN102591615A (en) * | 2012-01-16 | 2012-07-18 | 中国人民解放军国防科学技术大学 | Structured mixed bit-width multiplying method and structured mixed bit-width multiplying device |
CN111522528A (en) * | 2020-04-22 | 2020-08-11 | 厦门星宸科技有限公司 | Multiplier, multiplication method, operation chip, electronic device, and storage medium |
-
2021
- 2021-12-24 CN CN202111605030.1A patent/CN114239819B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000099313A (en) * | 1998-09-25 | 2000-04-07 | Denso Corp | Multiplier |
CN102591615A (en) * | 2012-01-16 | 2012-07-18 | 中国人民解放军国防科学技术大学 | Structured mixed bit-width multiplying method and structured mixed bit-width multiplying device |
CN111522528A (en) * | 2020-04-22 | 2020-08-11 | 厦门星宸科技有限公司 | Multiplier, multiplication method, operation chip, electronic device, and storage medium |
Non-Patent Citations (2)
Title |
---|
樊迪;王健;来金梅;: "FPGA中适用于低位宽乘累加的DSP块", 复旦学报(自然科学版), no. 05 * |
王楠;黄志洪;杨海钢;丁健;: "一种支持高效加法的FPGA嵌入式DSP IP设计", 太赫兹科学与电子信息学报, no. 05 * |
Also Published As
Publication number | Publication date |
---|---|
CN114239819B (en) | 2023-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111667051B (en) | Neural network accelerator applicable to edge equipment and neural network acceleration calculation method | |
CN106909970B (en) | Approximate calculation-based binary weight convolution neural network hardware accelerator calculation device | |
CN110119809B (en) | Apparatus and method for performing MAC operations on asymmetrically quantized data in neural networks | |
CN108564168B (en) | Design method for neural network processor supporting multi-precision convolution | |
US20180197084A1 (en) | Convolutional neural network system having binary parameter and operation method thereof | |
Draisma et al. | A bootstrap-based method to achieve optimality in estimating the extreme-value index | |
CN107301456B (en) | Deep neural network multi-core acceleration implementation method based on vector processor | |
CN110705703B (en) | Sparse neural network processor based on systolic array | |
CN111898733B (en) | Deep separable convolutional neural network accelerator architecture | |
US10776078B1 (en) | Multimodal multiplier systems and methods | |
CN112434801B (en) | Convolution operation acceleration method for carrying out weight splitting according to bit precision | |
WO2023065983A1 (en) | Computing apparatus, neural network processing device, chip, and data processing method | |
US12039430B2 (en) | Electronic device and method for inference binary and ternary neural networks | |
CN108897716A (en) | By memory read/write operation come the data processing equipment and method of Reduction Computation amount | |
CN113052299B (en) | Neural network memory computing device based on lower communication bound and acceleration method | |
CN110766136B (en) | Compression method of sparse matrix and vector | |
CN114239819B (en) | Mixed bit width accelerator based on DSP and fusion calculation method | |
CN111626410B (en) | Sparse convolutional neural network accelerator and calculation method | |
CN111966327A (en) | Mixed precision space-time multiplexing multiplier based on NAS (network attached storage) search and control method thereof | |
CN108090865B (en) | Optical satellite remote sensing image on-orbit real-time streaming processing method and system | |
WO2023078364A1 (en) | Operation method and apparatus for matrix multiplication | |
CN114021070A (en) | Deep convolution calculation method and system based on micro-architecture processor | |
WO2019205064A1 (en) | Neural network acceleration apparatus and method | |
US20170068518A1 (en) | Apparatus and method for controlling operation | |
JP7238376B2 (en) | Information processing system and information processing system control method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |