CN114239819A - DSP-based hybrid bit width accelerator and fusion calculation method - Google Patents

DSP-based hybrid bit width accelerator and fusion calculation method Download PDF

Info

Publication number
CN114239819A
CN114239819A CN202111605030.1A CN202111605030A CN114239819A CN 114239819 A CN114239819 A CN 114239819A CN 202111605030 A CN202111605030 A CN 202111605030A CN 114239819 A CN114239819 A CN 114239819A
Authority
CN
China
Prior art keywords
multiplicand
multiplier
dsp
width
signal input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111605030.1A
Other languages
Chinese (zh)
Other versions
CN114239819B (en
Inventor
杨晨
王佳兴
席嘉蔚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202111605030.1A priority Critical patent/CN114239819B/en
Publication of CN114239819A publication Critical patent/CN114239819A/en
Application granted granted Critical
Publication of CN114239819B publication Critical patent/CN114239819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a mixed bit width accelerator based on a high bit width DSP and a fusion calculation method, wherein the DSP is used as a main calculation unit, a multiplier and a multiplicand are respectively connected, and shifting and inserting different isolation bit widths are respectively carried out, so that multiple groups of arbitrary low bit width multiplication accumulation operation can be realized. The accelerator supports any multiplication parallelism, and maximizes the DSP computing performance; the multiplier and the multiplicand with any bit width are supported, and the fixed and unfixed conditions of the multiplier are supported, so that the universality is better, and the application range is wide.

Description

DSP-based hybrid bit width accelerator and fusion calculation method
Technical Field
The invention relates to a hybrid bit width neural network accelerator design, in particular to a hybrid bit width accelerator based on a high bit width DSP and a fusion calculation method.
Background
The DSP mentioned in the invention refers to DSP IP in FPGA, namely IP for realizing digital signal processing technology. The advantage of DSP is the high speed and real time nature of the processed signal. The DSP is internally provided with a special hardware multiplier and can be used for quickly realizing various digital signal processing algorithms. Taking the DSP48E1 IP of Xilinx 7 series FPGA as an example, the internal structure mainly includes a multiplier with a bit width of 25 × 18 and an accumulator with 48 bits, so that operations such as multiplication, multiplication accumulation, bitwise logic operation and the like can be realized without the help of FPGA resources, and the application in the communication fields such as graphic image processing, voice processing, signal processing and the like is wide.
In order to improve the calculation speed of the neural network and relieve the parameter storage pressure, a large amount of quantification methods are used in the deep neural network. The bit width of the quantized network parameter is reduced, so that the calculation efficiency is improved, the storage resource overhead is reduced, the energy consumption is reduced, and the hardware performance is greatly improved. Because different network layer parameters have different degrees of influence on the loss function, if the neural network is quantized with uniform quantization precision, if the quantization precision is too low, the network performance is greatly reduced, and if the quantization precision is too high, the advantage of quantization cannot be fully exerted. Based on this consideration, hybrid bit width quantization needs to be explored to optimize the network quantization problem. And selecting quantization with different precisions for different layers of the network based on the quantization method for the mixed bit width quantization. Memory computation is very common for mixed bit width quantization, on one hand, parameters with lower bit width are friendly to limited memory capacity, and on the other hand, supporting mixed bit width quantization actually becomes a demand of a network model generated by an upper layer on a neural network accelerator, because the network model of the upper layer is an effective method for increasing operation speed and reducing quantization bit width. The hybrid bit width quantization may significantly reduce the number of parameters while ensuring accuracy.
Studies have shown that the same accuracy is maintained, no floating point calculations are required in deep learning inferences, and for many applications such as image classification, only INT8 or lower fixed point computational accuracy is required to maintain acceptable inference accuracy. Under the background of reduced quantization precision, the fixed bit width of the DSP inevitably causes a waste of computing resources. And because the neural network has the characteristics of short update iteration period, quick change and large difference of quantization parameters of different networks, designing a corresponding low-bit-width DSP for a specific network is unrealistic. One solution is to try to implement multiple low bit-width multiply-accumulate operations in parallel using a high bit-width DSP. In this respect, Xilinx proposes a parallel method under int8 quantization based on its DSP48E2, and implements a scheme of implementing two 8-bit-wide multiplications in parallel with a 27 × 18 DSP, thereby achieving 1.75 times of improvement of peak computation performance. But it has problems that: 1) the neural network quantization bit width only considers 8 bits, while for a network with a mixed bit width quantization, quantization parameters with lower bit width (such as 4 bits, 2 bits, 1bit and the like) are common; 2) only the condition of weight sharing is considered, namely multipliers in two groups of parallel multiplications are fixed, so the universality is not strong; 3) limited by its 8-bit quantization bit width considerations, no implementation of higher parallelism (i.e., implementing more sets of multiplications simultaneously) has been proposed.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a hybrid bit width accelerator based on a high bit width DSP and a fusion calculation method, which support two situations of fixed multiplier and unfixed multiplier, and have better universality; the multi-group arbitrary low bit width multiply-accumulate operation can be realized in parallel, and the DSP calculation performance is maximized.
The invention is realized by the following technical scheme:
a mixed bit width accelerator based on DSP comprises a first signal input end, a second signal input end, a third signal input end, a fourth signal input end, a first shifting unit, a first adder, a first selector, a third selector, a second shifting unit, a second adder, a second selector, a fourth selector and a DSP;
the first signal input end and the second signal input end are respectively used for receiving a multiplicand, and the third signal input end and the fourth signal input end are respectively used for receiving a multiplier;
when the parallelism is 1, a multiplicand at the first signal input end directly enters the DSP through the first selector, a multiplier at the third signal input end directly enters the DSP through the second selector, and the multiplicand and the multiplier are subjected to multiplication accumulation operation in the DSP;
when the parallelism is more than 1, the multiplicand of the first signal input end is shifted in the first shifting unit, the multiplicand enters the first adder and is added with the multiplicand of the second signal input end after the shifting operation, the multiplicand obtained after the adding operation enters the third selector, the multiplicand enters the first adder through the third selector, and the multiplicand continues to be added with the multiplicand of the first signal input end after the shifting operation until all the multiplicands finish the adding operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the multiplier of the third signal input end is shifted in the second shifting unit, the shifted multiplier enters the second adder and is added with the multiplier of the fourth signal input end, the multiplier obtained after the addition enters the fourth selector, the multiplier enters the second adder through the selection of the fourth selector, and the multiplier continues to be added with the multiplier of the third signal input end after the shifting operation until all the multipliers finish the addition operation; after all multipliers finish addition operation, the high-bit-width multiplier obtained by the second adder enters the DSP through the selection of the second selector; and the high bit width multiplicand and the high bit width multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
Preferably, when the parallelism is greater than 1, the shift number calculation method when performing the shift operation is:
setting the multiplicand bit width of the first signal input end as n1, setting the multiplier bit width entering the first adder through the third selector as n2, setting the multiplier bit width of the third signal input end as n3, and setting the multiplier bit width entering the second adder through the fourth selector as n 4; the number of shifts for n1 and n3 is x + n4 and y + n2, respectively, where x, y, n1, n2, n3, and n4 satisfy the following constraints:
x+y+2(n2+n4)≥max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)。
preferably, a redundant bit width is inserted before each multiplicand of the high bit width multiplicands obtained by the first adder.
Preferably, the method for calculating the bit number of the redundant bit width comprises the following steps:
if the number of bits of the redundant bit width inserted before each multiplicand is q, the calculation formula of q is as shown in formula (6)
Figure BDA0003433409610000031
Wherein m is the highest bit width of the DSP, d is the parallelism, and mult1 and mult2 are the bit widths of the high bit width multiplicand obtained by the first adder and the high bit width multiplier obtained by the second adder, respectively.
Further, when the parallelism is 2, the multiplicand bit width of the first signal input end is set to be n1, the multiplier bit width entering the first adder through the third selector is set to be n2, the multiplier bit width of the third signal input end is set to be n3, and the multiplier bit width entering the second adder through the fourth selector is set to be n 4;
setting the supportable maximum accumulation times of the redundant bit width to be num _ acc, wherein a calculation formula is shown as a formula (3), max _ w1 is the maximum bit width of a high-order partial product, a middle partial product and a low-order partial product, the calculation formula is shown as a formula (4), mp _ w is the bit width of the middle partial product, and the calculation formula is shown as a formula (5);
Figure BDA0003433409610000041
max_w1=max(n1+n3,n2+n4,mp_w) (4)
mp_w=max(n1+n4+x+n4+n2,n2+n3+y+n2+n4)+1-min(x+n2+n4,y+n2+n4) (5)
wherein x, y, n1, n2, n3, and n4 satisfy the following constraints:
x+y+2(n2+n4)≥max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)。
further, when the parallelism is greater than 2, the multiple groups of multiplications are combined into two groups of multiplications, and then the maximum accumulation times are calculated by using the method of processing the parallelism to be 2.
Preferably, when the parallelism is greater than 1 and is a fixed multiplier, the multiplicand at the first signal input end is subjected to shift operation in the first shift unit, the multiplicand enters the first adder and is subjected to addition operation with the multiplicand at the second signal input end after the shift operation, the multiplicand obtained after the addition operation enters the third selector, the multiplicand is selected by the third selector to enter the first adder, and the addition operation is continued to be performed with the multiplicand at the first signal input end after the shift operation until all the multiplicands finish the addition operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the fixed multiplier of the third signal input end directly enters the DSP through the second selector; the high bit width multiplicand and the fixed multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
Further, the shift operation is specifically: the multiplier bit width is inserted after the multiplicand by 0.
Further, a redundant bit is inserted before each multiplicand in the high bit width multiplicand obtained by the first adderWidth; the parallelism is set to be d, the bit width of a fixed multiplier is set to be n, and the calculation formula of the maximum accumulation times num _ acc which can be supported by the redundant bit width is as shown in a formula (9); wherein max _ w2 represents the sum of the maximum bit width in the multiplicand and the multiplier bit width; q is the number of bits of the redundant bit width inserted before each multiplicand, and the calculation formula is as shown in formula (10), wherein niRepresents the multiplicand bit width;
Figure BDA0003433409610000051
Figure BDA0003433409610000052
a mixed bit width fusion calculation method based on DSP comprises that based on the accelerator,
when the parallelism is 1, a multiplicand at the first signal input end directly enters the DSP through the first selector, a multiplier at the third signal input end directly enters the DSP through the second selector, and the multiplicand and the multiplier are subjected to multiplication accumulation operation in the DSP;
when the parallelism is more than 1, the multiplicand of the first signal input end is shifted in the first shifting unit, the multiplicand enters the first adder and is added with the multiplicand of the second signal input end after the shifting operation, the multiplicand obtained after the adding operation enters the third selector, the multiplicand enters the first adder through the third selector, and the multiplicand continues to be added with the multiplicand of the first signal input end after the shifting operation until all the multiplicands finish the adding operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the multiplier of the third signal input end is shifted in the second shifting unit, the shifted multiplier enters the second adder and is added with the multiplier of the fourth signal input end, the multiplier obtained after the addition enters the fourth selector, the multiplier enters the second adder through the selection of the fourth selector, and the multiplier continues to be added with the multiplier of the third signal input end after the shifting operation until all the multipliers finish the addition operation; after all multipliers finish addition operation, the high-bit-width multiplier obtained by the second adder enters the DSP through the selection of the second selector; and the high bit width multiplicand and the high bit width multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
Compared with the prior art, the invention has the following beneficial effects:
the invention uses DSP as the mixed bit width neural network accelerator of the main calculating unit, connects the multiplier and the multiplicand respectively, and shifts and inserts different isolation bit widths respectively, thereby realizing the multiply-accumulate operation of multiple groups of arbitrary low bit widths. The accelerator supports any multiplication parallelism, and maximizes the DSP computing performance; the multiplier and the multiplicand with any bit width are supported, and the fixed and unfixed conditions of the multiplier are supported, so that the universality is better, and the application range is wide.
Drawings
FIG. 1 illustrates a high bit width multiplier implementing two low bit width multiplication schemes simultaneously;
FIG. 2 illustrates a high bit width multiplier implementing two low bit width multiplication data formats simultaneously;
FIG. 3 is a diagram of a high bit width multiplication implementing three low bit width multiplication schemes simultaneously; (a) the high bit width multiplication simultaneously realizes the first steps of three low bit width multiplication schemes; (b) the second step of the three low bit width multiplication schemes is realized simultaneously by the high bit width multiplication;
FIG. 4 is a middle partial product bit width analysis;
FIG. 5 is a diagram of a high bit width multiplication implementation with two same multiplier multiplication interpolation schemes;
FIG. 6 is a diagram of a high bit width multiplication implementation with three same multiplier multiplication interpolation schemes;
FIG. 7 is a diagram of a data format for implementing three multiplicand multiplications for high bit-width multiplication;
FIG. 8 is a PE allocation in a network computation;
FIG. 9 is a PE internal structure;
fig. 10 shows the internal structure of the DSP.
Detailed Description
For a further understanding of the invention, reference will now be made to the following examples, which are provided to illustrate further features and advantages of the invention, and are not intended to limit the scope of the invention as set forth in the following claims.
First, a scheme for implementing two low bit width multiply-accumulate operations by a high bit width DSP is described, as shown in fig. 1. For two sets of low bit width multiplications to be performed, the multiplicand and multiplier have bit widths n1 and n3, n2 and n4, respectively. The two multiplicands and the multiplier form multiplication with two high bit widths by a shift splicing method, and the shift numbers during splicing are x + n4 and y + n2 respectively. Wherein x, y, n1, n2, n3, n4 satisfy the following constraints:
x+y+2(n2+n4)≥max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)
under the constraint, the result of multiplying the high bit width data in the right half of fig. 1 is obtained by calculating the result of multiplying the high bit width data in the right half of fig. 1, and the high (n1+ n3) bit and the low (n2+ n4) bit are respectively taken from the result, so that the result of multiplying the two groups of low bit width data in the left half of fig. 1 (i.e., n1 × n3 and n2 × n4) can be obtained simultaneously.
In order to prevent overflow in consideration of the accumulation operation after multiplication, in the data format of the right half of fig. 1, extra bit widths need to be inserted in front of the left calculation data and the right calculation data of each multiplicand, respectively. Taking the insertion of 2-bit width as an example, the final data format is shown in fig. 2.
And (3) setting the maximum redundant bit width which can be allocated to each multiplicand as q, and then calculating the q according to the formula (2). Wherein m is the highest bit width of the DSP multiplier, d is the parallelism, that is, one DSP simultaneously performs several groups of multiplication and accumulation.
Figure BDA0003433409610000071
Setting the supportable maximum accumulation times of the redundant bit width as num _ acc, wherein a calculation formula is shown as a formula (3), q is the maximum redundant bit width which can be allocated to each multiplicand, and the calculation formula is shown as a formula (2); max _ w1 is the maximum bit width of the upper, middle and lower partial products, and the calculation formula is shown in equation (4), where mp _ w is the bit width of the middle partial product, and the calculation formula is shown in equation (5).
Figure BDA0003433409610000072
max_w1=max(n1+n3,n2+n4,mp_w) (4)
mp_w=max(n1+n4+x+n4+n2,n2+n3+y+n2+n4)+1-min(x+n2+n4,y+n2+n4) (5)
Consider next a scenario where one DSP simultaneously performs more (greater than 2) multiply-accumulate operations, as shown in fig. 3. For three sets of low bit width multiplications to be performed, the multiplicand and multiplier have bit widths n1 ', n 2' and n3 ', n 4', n5 'and n 6', respectively. The basic idea of the scheme is to combine three groups of multiplications into two groups of multiplications and then sequentially process the multiplications by using a scheme for processing the two groups of multiplications in parallel. As shown in fig. 3(a), first, n1 ═ n1 '+ n 2', n3 ═ n4 '+ n 5', n2 ═ n3 'and n4 ═ n 6' are respectively substituted into formula (1), so as to obtain the sizes of x1 and y 1; in the second step, n1 ═ n1 ', n3 ═ n 4', n2 ═ n2 '+ x1+ n 6' + n3 ', n4 ═ n 5' + y1+ n6 '+ n 3' are respectively substituted into formula (1), and the sizes of x2 and y2 are obtained, as shown in fig. 3 (b).
The multiplicand in the divided multiplication operation comprises 5 parts, wherein redundant bit widths are required to be respectively reserved in front of three actual multiplicands for preventing overflow during accumulation operation. The method for calculating the redundant bit width and the maximum accumulation times can be obtained by analogy with the scheme with the parallelism degree of 2.
And (4) setting the number of bits inserted into the maximum redundant bit width as q, and then calculating the q according to the formula (6). Wherein m is the highest bit width of the DSP multiplier; d is the parallelism, i.e. one DSP simultaneously makes several groups of multiplication and accumulation; mult1 and mult2 are bit widths of two multipliers after shift splicing, namely
mult1=n1’+n2’+n3’+x1+x2+n3’+n5’+n6’+y1+n6’
mult2=n4’+n5’+n6’+y1+y2+n2’+n3’+n6’+x1+n3’。
Figure BDA0003433409610000073
The supportable accumulation times of the redundant bit width are num _ acc, a calculation formula is shown as a formula (3), wherein q is the maximum redundant bit width which can be allocated to each multiplicand, and the calculation formula is shown as a formula (6); max _ w1 is the maximum bit width of the upper, middle and lower partial products, and the calculation formula is as formula (8), where mp _ w is the bit width of the middle partial product. The parallelism is 3 at this time, so mp _ w should comprise two part calculations. First, calculate the partial product of the gray part in fig. 4, the green part is two groups of multiplications, and the analysis is performed according to the parallelism of 2. Substituting n 1-n 2 ', n 2-n 3', n 3-n 5 ', n 4-n 6', x-x 1, and y-y 1 into formula (5) respectively to obtain mp _ w 1; and analyzing the partial products after adding the high bits, namely regarding the green part in the graph 4 as one group of multiplication and regarding n1 '. multidot.n 2' as another group of multiplication, and at the moment, analyzing according to the parallelism degree of 2. N1 ═ n1 ', n2 ═ n 2' + x1+ n6 '+ n 3', n3 ═ n4 ', n4 ═ n 5' + y1+ n3 '+ n 6', x ═ x2, and y ═ y2 are respectively substituted into formula (5), so that mp _ w2 is obtained; then mp _ w equals max (mp _ w1, mp _ w2), and the maximum bit width max _ w1 of the partial product with parallelism of 3 can be obtained by substituting equation (8); at this time, max _ w1 is substituted into equation (7), so as to obtain the maximum accumulation number.
Figure BDA0003433409610000081
max_w1=max(n1′+n4′,n2′+n5′,n3′+n6′,mp_w) (8)
For the realization of higher parallelism, firstly, according to the figure 3, the data are merged and then analyzed according to the condition that the parallelism is 2, and the shift bit width when each multiplier is spliced is obtained; then, by analogy of the redundant bit width and maximum accumulation times analysis method with the parallelism of 3, the multipliers are combined and reduced to the parallelism of 2 in sequence, and the bit number and the maximum accumulation times of the redundant bit width under the corresponding high parallelism are obtained through calculation.
Then, the condition that one multiplier in the multiple groups of multiplications is fixed is considered, and the weights are shared when the weights are calculated by corresponding actual networks. For the case of two parallel sets of multiplications, the method of 0 value insertion at this time is shown in FIG. 5. Multiplying the numbers of two bit widths n1 and n2 by the same multiplier with the bit width n3, and only inserting n 30 before n2 to isolate; for the case of three groups of parallel multiplications, the method of inserting 0 value is shown in fig. 6, the bit width of the fixed multiplier is n4, and n4 0 values need to be inserted between every two of the three multiplicands; similarly, for the condition of more groups of multiplication paralleling, the isolation effect can be achieved only by inserting the bit width of the multiplier 0 between every two multiplicands. Finally, a redundant bit is inserted before each multiplicand to prevent accumulation overflow, and a final data format can be obtained, as shown in fig. 7, in which the redundant bit width is 2 as an example.
Under the condition that one multiplier is fixed, the parallelism is d, the bit width of the fixed multiplier in parallel calculation is n, and the calculation formula of the maximum accumulation times num _ acc is as shown in the formula (9). Wherein max _ w2 represents the sum of the maximum bit width in the multiplicand and the multiplier bit width; q is the maximum redundant bit width that can be allocated to each multiplicand, and the calculation formula is as shown in formula (10), where n isiIndicating the multiplicand bit width.
Figure BDA0003433409610000091
Figure BDA0003433409610000092
When the accumulation times reach the limit of the redundant bit width, additional DSPs are required to be introduced for continuous accumulation. The number of additional DSPs introduced is d-1(d is parallelism). From this the acceleration ratio r of the DSP can be calculated, as in equation (11).
Figure BDA0003433409610000093
The invention takes Xilinx 7 series FPGA chips as a platform to realize the neural network accelerator. The computing array of the accelerator mainly comprises 8 × 9 PEs (processing elements), wherein the computing function of each PE is realized based on a DSP48E1 of a Xilinx 7 series chip, the type of DSP includes 25 × 18 multiplier bit widths and 48 accumulator bit widths.
The PE allocation in the neural network computation process is illustrated in fig. 8 as an example. The figure shows the implementation of one layer of convolution, the size of the characteristic diagram is 80 channels 13 × 13, the number of convolution kernels is 64, and each convolution kernel is 80 channels 3 × 3. Taking the case of sharing the weight as an example, each row of PEs calculates one channel of the convolution kernel, and each PE receives one element from the convolution kernel and receives d values from the feature map according to the parallelism d set by the DSP, where fig. 8 shows a case where the parallelism is 2, and each PE receives two values from the feature map at the same time, which is equivalent to two convolutions being done at the same time. The data flow inside the PE at this time is as shown in fig. 9. Inputting a sharing weight value into a port I, and inputting the sharing weight value into a DSP port B through a multi-selector sel 1; one of the two values of the characteristic diagram is input into the port (c), the other value is input into the port (c), the value of the port (c) is added with the value of the port (c) after being shifted to obtain a splicing result, the spliced value is input into the DSP port A through a multi-selector sel2, and then multiplication is carried out by the DSP. The internal structure of the DSP is shown in fig. 10, and the four input ports B, A, D, C are 18, 30, 25, and 48 bits wide, respectively. For the condition of higher parallelism, the value of the adder after the first splicing is input to one end of the adder through the selector sel4, a new multiplier enters the other end of the adder after being shifted from the port III, and a secondary splicing result is obtained after the addition. For the case of the non-fixed multiplier, the high-parallelism multiplication of the non-fixed multiplier can be realized only by splicing the multipliers at the ports (i) and (ii) and (iii) of fig. 9 according to the method, and then inputting the splicing results to the port B and the port A of the DSP through sel1 and sel2 respectively.
Based on the parallel computing method, for the case of the non-fixed multiplier, taking the multiplicand and the multiplier bit width as 2 as an example: when the parallelism is 2, in this case, n1, n2, n3, n4, 2, the formula (1) is substituted, and the minimum value x of x and y satisfying the condition, y, is 1; equations (2) to (5) are further substituted to obtain q ═ 9, mp _ w ═ 5, max _ w1 ═ 5, and num _ acc ═ 528, respectively. Substituting d in 2 and num _ acc in 528 into equation (11) to obtain an acceleration ratio of 1.996; when the parallelism is 3, according to the procedure of fig. 3(a), n1, n3, n2, n4, and 2 are substituted into formula (1), and the minimum value x1, y1, of x and y satisfying the condition is taken as 3; proceeding with the step of fig. 3(b), it is found that n4 ' + n2 ' + n3 ' + n6 ' + x1+ y1+ n5 ' + n3 ' + n6 ' ═ 20>18, which exceeds the multiplier bit width of the DSP, so that the DSP48E1 cannot support 2-bit multiplication with 3-parallelism;
taking the multiplicand and the multiplier bit width as an example of 1, when the parallelism is 3, according to the step of fig. 3(a), substituting n 1-n 3-2, n 2-n 4-1 into formula (1), and taking the minimum value of x and y satisfying the condition to obtain x 1-y 1-1; the procedure of fig. 3(b) is further performed, where n1 ═ n3 ═ 1, n2 ═ n4 ═ 4 are substituted into formula (1), and the minimum value of x and y satisfying the conditions is taken, so that x2 ═ y2 ═ 3 is obtained. At this time, n2 '+ n 3' + n6 '+ x1 and n 5' + n3 '+ n 6' + y1 (for isolating the higher product and isolating the lower product from influence) are inserted to isolate the lower product and prevent the higher product from being influenced by the lower product, so that x2 ═ y2 ═ 0 (that is, when x and y calculated according to equation (1) are negative numbers, both of them may be 0). Equations (6) to (8) are further substituted to obtain q 5, mp _ w 13, mp _ w2 6, max _ w1 6, and num _ acc 32, respectively. The acceleration ratio at this time is 2.824 by substituting d in 3 and num _ acc in 32 into equation (11).
For the case of fixed multiplier, taking 4 bits as an example for both multiplier and multiplicand, the support parallelism is 2 and 3, and the acceleration ratio is 1.969 and 1.5 respectively; for the case that both the multiplier and the multiplicand are 3 bits, the support parallelism is 2, 3 and 4, and the acceleration ratio is 1.992, 2.4 and 1.6 respectively; for the case that both the multiplier and the multiplicand are 2bit, the support parallelism is 2, 3, 4 and 5, and the acceleration ratio is 1.996, 2.833, 2.286 and 1.667 respectively; for the case where both the multiplier and multiplicand are 1bit, the support parallelism is 2, 3, 4, 5, 6, 7, 8, and the acceleration ratios are 1.999, 2.931, 3.5, 3.571, 3, 1.75, 1.778, respectively.
The present invention is compared to prior art methods, as shown in table 1.
TABLE 1 comparison of the present invention with existing methods
Method Degree of parallelism Multiplier bit width Multiplier limit
Xilinx int8 2 8 Fixing a multiplier
The invention Arbitrary Arbitrary Arbitrary multiplier/fixed multiplier

Claims (10)

1. A mixed bit width accelerator based on DSP is characterized by comprising a first signal input end, a second signal input end, a third signal input end, a fourth signal input end, a first shifting unit, a first adder, a first selector, a third selector, a second shifting unit, a second adder, a second selector, a fourth selector and a DSP;
the first signal input end and the second signal input end are respectively used for receiving a multiplicand, and the third signal input end and the fourth signal input end are respectively used for receiving a multiplier;
when the parallelism is 1, a multiplicand at the first signal input end directly enters the DSP through the first selector, a multiplier at the third signal input end directly enters the DSP through the second selector, and the multiplicand and the multiplier are subjected to multiplication accumulation operation in the DSP;
when the parallelism is more than 1, the multiplicand of the first signal input end is shifted in the first shifting unit, the multiplicand enters the first adder and is added with the multiplicand of the second signal input end after the shifting operation, the multiplicand obtained after the adding operation enters the third selector, the multiplicand enters the first adder through the third selector, and the multiplicand continues to be added with the multiplicand of the first signal input end after the shifting operation until all the multiplicands finish the adding operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the multiplier of the third signal input end is shifted in the second shifting unit, the shifted multiplier enters the second adder and is added with the multiplier of the fourth signal input end, the multiplier obtained after the addition enters the fourth selector, the multiplier enters the second adder through the selection of the fourth selector, and the multiplier continues to be added with the multiplier of the third signal input end after the shifting operation until all the multipliers finish the addition operation; after all multipliers finish addition operation, the high-bit-width multiplier obtained by the second adder enters the DSP through the selection of the second selector; and the high bit width multiplicand and the high bit width multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
2. The DSP-based hybrid bit width accelerator of claim 1, wherein when the parallelism is greater than 1, the shift number calculation method when performing the shift operation is:
setting the multiplicand bit width of the first signal input end as n1, setting the multiplier bit width entering the first adder through the third selector as n2, setting the multiplier bit width of the third signal input end as n3, and setting the multiplier bit width entering the second adder through the fourth selector as n 4; the number of shifts for n1 and n3 is x + n4 and y + n2, respectively, where x, y, n1, n2, n3, and n4 satisfy the following constraints:
x+y+2(n2+n4)≥
max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)。
3. the DSP based hybrid bit-width accelerator of claim 1, wherein a redundant bit-width is inserted before each multiplicand in the high bit-width multiplicand resulting from the first adder.
4. The DSP-based hybrid bit-width accelerator of claim 1, wherein the redundant bit-width bit number calculation method is:
if the number of bits of the redundant bit width inserted before each multiplicand is q, the calculation formula of q is as shown in formula (6)
Figure FDA0003433409600000021
Wherein m is the highest bit width of the DSP, d is the parallelism, and mult1 and mult2 are the bit widths of the high bit width multiplicand obtained by the first adder and the high bit width multiplier obtained by the second adder, respectively.
5. The DSP-based hybrid bit-width accelerator of claim 4, wherein when the parallelism is 2, assuming that the multiplicand bit-width at the first signal input is n1, the multiplier bit-width entering the first adder via the third selector is n2, the multiplier bit-width at the third signal input is n3, and the multiplier bit-width entering the second adder via the fourth selector is n 4;
setting the supportable maximum accumulation times of the redundant bit width to be num _ acc, wherein a calculation formula is shown as a formula (3), max _ w1 is the maximum bit width of a high-order partial product, a middle partial product and a low-order partial product, the calculation formula is shown as a formula (4), mp _ w is the bit width of the middle partial product, and the calculation formula is shown as a formula (5);
Figure FDA0003433409600000022
max_w1=max(n1+n3,n2+n4,mp_w) (4)
mp_w=max(n1+n4+x+n4+n2,n2+n3+y+n2+n4)+1-min(x+n2+n4,y+n2+n4) (5)
wherein x, y, n1, n2, n3, and n4 satisfy the following constraints:
x+y+2(n2+n4)≥
max(n1+x+n4+n2+n4,n3+y+n2+n4+n2)+1 (1)。
6. the DSP-based hybrid bit width accelerator of claim 5, wherein when the parallelism is greater than 2, the plurality of groups of multiplications are first combined into two groups of multiplications, and then the maximum number of accumulations is calculated using a method that handles parallelism of 2.
7. The DSP-based hybrid bit width accelerator of claim 1, wherein when the parallelism is greater than 1 and is a fixed multiplier, the multiplicand at the first signal input end performs a shift operation in the first shift unit, the multiplicand after the shift operation enters the first adder and performs an addition operation with the multiplicand at the second signal input end, the multiplicand obtained after the addition operation enters the third selector, and is selected by the third selector to enter the first adder, and the addition operation with the multiplicand at the first signal input end after the shift operation is continued until all the multiplicands complete the addition operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the fixed multiplier of the third signal input end directly enters the DSP through the second selector; the high bit width multiplicand and the fixed multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
8. The DSP-based hybrid bit-width accelerator according to claim 7, wherein the shift operation is specifically: the multiplier bit width is inserted after the multiplicand by 0.
9. The DSP based hybrid bit-width accelerator of claim 7 wherein a redundant bit-width is inserted before each multiplicand in the high bit-width multiplicand resulting from the first adder; the parallelism is set to be d, the bit width of a fixed multiplier is set to be n, and the calculation formula of the maximum accumulation times num _ acc which can be supported by the redundant bit width is as shown in a formula (9); wherein max _ w2 represents the sum of the maximum bit width in the multiplicand and the multiplier bit width; q is the number of bits of the redundant bit width inserted before each multiplicand, and the calculation formula is as shown in formula (10), wherein niRepresents the multiplicand bit width;
Figure FDA0003433409600000031
Figure FDA0003433409600000032
10. a hybrid bit width fusion calculation method based on DSP, characterized in that, based on the accelerator of claim 1,
when the parallelism is 1, a multiplicand at the first signal input end directly enters the DSP through the first selector, a multiplier at the third signal input end directly enters the DSP through the second selector, and the multiplicand and the multiplier are subjected to multiplication accumulation operation in the DSP;
when the parallelism is more than 1, the multiplicand of the first signal input end is shifted in the first shifting unit, the multiplicand enters the first adder and is added with the multiplicand of the second signal input end after the shifting operation, the multiplicand obtained after the adding operation enters the third selector, the multiplicand enters the first adder through the third selector, and the multiplicand continues to be added with the multiplicand of the first signal input end after the shifting operation until all the multiplicands finish the adding operation; after all multiplicands complete addition operation, the high-bit-width multiplicand obtained by the first adder enters the DSP through the selection of the first selector; the multiplier of the third signal input end is shifted in the second shifting unit, the shifted multiplier enters the second adder and is added with the multiplier of the fourth signal input end, the multiplier obtained after the addition enters the fourth selector, the multiplier enters the second adder through the selection of the fourth selector, and the multiplier continues to be added with the multiplier of the third signal input end after the shifting operation until all the multipliers finish the addition operation; after all multipliers finish addition operation, the high-bit-width multiplier obtained by the second adder enters the DSP through the selection of the second selector; and the high bit width multiplicand and the high bit width multiplier entering the DSP carry out multiplication accumulation operation in the DSP.
CN202111605030.1A 2021-12-24 2021-12-24 Mixed bit width accelerator based on DSP and fusion calculation method Active CN114239819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111605030.1A CN114239819B (en) 2021-12-24 2021-12-24 Mixed bit width accelerator based on DSP and fusion calculation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111605030.1A CN114239819B (en) 2021-12-24 2021-12-24 Mixed bit width accelerator based on DSP and fusion calculation method

Publications (2)

Publication Number Publication Date
CN114239819A true CN114239819A (en) 2022-03-25
CN114239819B CN114239819B (en) 2023-09-26

Family

ID=80762958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111605030.1A Active CN114239819B (en) 2021-12-24 2021-12-24 Mixed bit width accelerator based on DSP and fusion calculation method

Country Status (1)

Country Link
CN (1) CN114239819B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000099313A (en) * 1998-09-25 2000-04-07 Denso Corp Multiplier
CN102591615A (en) * 2012-01-16 2012-07-18 中国人民解放军国防科学技术大学 Structured mixed bit-width multiplying method and structured mixed bit-width multiplying device
CN111522528A (en) * 2020-04-22 2020-08-11 厦门星宸科技有限公司 Multiplier, multiplication method, operation chip, electronic device, and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000099313A (en) * 1998-09-25 2000-04-07 Denso Corp Multiplier
CN102591615A (en) * 2012-01-16 2012-07-18 中国人民解放军国防科学技术大学 Structured mixed bit-width multiplying method and structured mixed bit-width multiplying device
CN111522528A (en) * 2020-04-22 2020-08-11 厦门星宸科技有限公司 Multiplier, multiplication method, operation chip, electronic device, and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
樊迪;王健;来金梅;: "FPGA中适用于低位宽乘累加的DSP块", 复旦学报(自然科学版), no. 05 *
王楠;黄志洪;杨海钢;丁健;: "一种支持高效加法的FPGA嵌入式DSP IP设计", 太赫兹科学与电子信息学报, no. 05 *

Also Published As

Publication number Publication date
CN114239819B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN111667051B (en) Neural network accelerator applicable to edge equipment and neural network acceleration calculation method
CN106909970B (en) Approximate calculation-based binary weight convolution neural network hardware accelerator calculation device
CN110119809B (en) Apparatus and method for performing MAC operations on asymmetrically quantized data in neural networks
CN108564168B (en) Design method for neural network processor supporting multi-precision convolution
US20180197084A1 (en) Convolutional neural network system having binary parameter and operation method thereof
Draisma et al. A bootstrap-based method to achieve optimality in estimating the extreme-value index
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN110705703B (en) Sparse neural network processor based on systolic array
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
US10776078B1 (en) Multimodal multiplier systems and methods
CN112434801B (en) Convolution operation acceleration method for carrying out weight splitting according to bit precision
WO2023065983A1 (en) Computing apparatus, neural network processing device, chip, and data processing method
US12039430B2 (en) Electronic device and method for inference binary and ternary neural networks
CN108897716A (en) By memory read/write operation come the data processing equipment and method of Reduction Computation amount
CN113052299B (en) Neural network memory computing device based on lower communication bound and acceleration method
CN110766136B (en) Compression method of sparse matrix and vector
CN114239819B (en) Mixed bit width accelerator based on DSP and fusion calculation method
CN111626410B (en) Sparse convolutional neural network accelerator and calculation method
CN111966327A (en) Mixed precision space-time multiplexing multiplier based on NAS (network attached storage) search and control method thereof
CN108090865B (en) Optical satellite remote sensing image on-orbit real-time streaming processing method and system
WO2023078364A1 (en) Operation method and apparatus for matrix multiplication
CN114021070A (en) Deep convolution calculation method and system based on micro-architecture processor
WO2019205064A1 (en) Neural network acceleration apparatus and method
US20170068518A1 (en) Apparatus and method for controlling operation
JP7238376B2 (en) Information processing system and information processing system control method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant