CN110647309A

CN110647309A - High-speed big bit width multiplier

Info

Publication number: CN110647309A
Application number: CN201910934899.7A
Authority: CN
Inventors: 吴冰瑞; 俞艳东; 张培勇; 陆玲霞
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-01-03
Anticipated expiration: 2039-09-29
Also published as: CN110647309B

Abstract

The invention provides a high-speed large-bit-width multiplier. The multiplier comprises two complementary clocks, a CLA adder, an overflow processing module, a decoder, a K bit multiplying unit and a data operation module; the operation method of the multiplier comprises the following steps: dividing the partial integrals into two groups, controlling the partial integrals of each group by different clocks, and performing parallel operation; and respectively carrying out multiplication operation and shift addition operation on the rising edges of the two complementary clocks to obtain a final multiplication result. The high-speed large-bit-width multiplier reduces the clock cycle consumption by half, and improves the operation speed of the multiplier. The multiplier can be used in the fields of integrated circuits, programmable logic devices, digital signal processing, communication and the like, and is characterized by simple circuit structure, less occupied resources, high speed and capability of realizing multiplication operation of operands with large bit width.

Description

High-speed big bit width multiplier

Technical Field

The invention belongs to the field of computers and integrated circuits, and particularly relates to a design of a high-speed large-bit-width multiplier which can be applied to the fields of digital image processing, communication and the like.

Background

With the rapid development of artificial intelligence, cloud computing and internet of things technologies, the performance requirements of various processors are higher and higher. The multiplier, which is a core component in the processor, largely determines the operating frequency of the whole system due to the long delay time, and a large-bit-width multiplier consumes a larger chip area. Therefore, the speed and area of the multiplier will determine the performance and cost of the whole processor, and is the key point of system optimization.

Some high-speed multipliers have been proposed in recent years, and the main types thereof are classified into the following four types: an adder tree multiplier, a parallel multiplier, a look-up table multiplier, and a shift-and-add multiplier. The addition tree multiplier and the parallel multiplier have high operation speed and have the defect that the consumed hardware resources can be rapidly increased along with the increase of the number of multiplier bits; the lookup table multiplier accesses a memory storing multiplication results by using operands as addresses, the speed of the lookup table multiplier depends on the access speed of the memory, and when the number of bits of the multiplier is increased, the space of the memory is increased sharply; the shift-add multiplier has the disadvantages of low resource consumption and low speed.

The multiplier operation has two main aspects, namely generation of partial products on one hand and accumulation of the partial products on the other hand. Therefore, the method for increasing the speed of the multiplier is mainly to reduce the number of partial products and increase the accumulation speed of the partial products. At present, a plurality of schemes for realizing the multiplier are provided, and the emphasis of the schemes is to increase the speed of the multiplier, but the optimization of the speed and the resource consumption of the multiplier with large bit width is neglected.

Disclosure of Invention

The invention aims to provide a high-speed large-bit-width multiplier which is characterized by simple circuit structure, less occupied resources and high speed, and can realize large-bit-width operand multiplication operation. The multiplier can be used in the fields of integrated circuits, programmable logic devices, digital signal processing, communication and the like. The multiplier of the invention can reduce the consumption of clock period by half and improve the operation speed of the multiplier.

In order to achieve the purpose, the invention adopts the technical scheme that:

a high-speed big-bit-width multiplier comprises two complementary clocks, a CLA adder, an overflow processing module, a decoder, a K-bit multiplying unit and a data operation module;

the two complementary clocks have the same frequency and opposite phases and are used for controlling the two counters;

the CLA adder, namely the carry look ahead adder, is characterized in that carry signals of all stages are generated simultaneously, and the time for generating carry is greatly reduced. Here the adder will divide the X multiplier { a_k-1a_k-2…a₁a₀Each portion is added two by two, e.g. a₀+a₁，a₀+a₂，a₀+a₃，a₁+a₂，......，a_k-1+a_k. Similarly, for the Y multiplier { b }_k-1b_k-2...b₁b₀Is given by b₀+b₁，b₀+b₂，b₀+b₃，b₁+b₂，......，b_k-1+b_k。

The overflow processing module mainly judges the result of the first-stage CLA adder, and when the bit width of the result is greater than K bits, the output result of the module is sent to a decoder and a data operation module at the lower stage under the control of a clock.

The decoder mainly decodes the code of the counter, selects the data sent from the front stage, stores the data in a corresponding register, and then sends the data to the K bit multiplication unit for operation.

The K bit multiplying unit adopts a partial product generator with two multipliers of K bits and 2 x K bits of output result, and the type of the partial product generator can comprise a pipeline type, a Booth type and the like.

The data operation module is mainly used for carrying out shift addition on the operation result of the K bit multiplication unit under the control of a clock and a counter, the output result of the data operation module enters the next-stage adder for final calculation, and the second-stage CLA adder outputs a correct operation result under the control of the counter.

Further, the method for sending the module output result to the lower-level decoder and the data operation module by the overflow processing module is as follows: the low K bits of the result of the CLA adder are sent to a decoder, and the highest bit of the result is sent to a data operation module, so that the multiplication result is correct.

Further, the operation method of the multiplier is as follows: the karatsuba algorithm is applied to hardware of a parallel multiplier, parts are integrated into two groups, and the two clocks control corresponding counters to form different codes so as to realize control of a decoder, wherein the coding mode can adopt one-hot codes or GRAY codes. When the rising edge of the clock comes, the corresponding multipliers are sequentially sent into the K bit multiplication unit to finish multiplication operation and shift addition operation, and the final multiplication result can be obtained.

Further, the application of the karatsuba algorithm to the hardware of a parallel multiplier may be used to reduce the partial products during large bit width multiplication operations.

The karatsuba algorithm is a fast multiplication algorithm, is mainly used for multiplication of two large numbers, greatly improves the operation efficiency, reduces the complexity compared with the common multiplication, and applies the recursive idea. The basic principle and practice is to divide the two large numbers x and y, which have a large number of bits, into a number with a small number of bits. After this process, it is simplified to make three multiplications, with a small number of addition operations and shift operations.

Compared with the faster tom-Cook algorithm, the algorithm is simpler in hardware implementation, the area cost and the hardware realizability are considered comprehensively, and the karatsuba algorithm is more suitable for multipliers in the processor.

The specific implementation method comprises the following steps: dividing the input two large bit widths x and y of n bits into m k bit numbers, wherein m is greater than 0, and k is greater than 0. Let X be { a ═ a_k-1a_k-2…a₁a₀}，Y＝{b_k-1b_k-2…b₁b₀The following are provided:

X*Y＝a₀*b₀+(a₁*b₀+a₀*b₁)*2^k+...+2^2(m-1)ka_k-1b_k-1 (1)

among these, according to the karatsuba algorithm:

a₁*b₀+a₀*b₁＝(a₀+a₁)*(b₀+b₁)-a₀*b₀-a₁*b₁ (2)

replacing the coefficients in equation (1) with the right three terms of the equation yields:

X*Y＝a₀*b₀+[(a₀+a₁)*(b₀+b₁)-a₀*b₀-a₁*b₁]*2^k+...+2^2(m-1)^ka_k-1b_k-1 (3)

the number of such partial products is from m²Is reduced to

Namely, it is

And (4) respectively.

In the invention, since the coefficients in the formula (3) do not affect each other from the first term and from the last term, the multiplication of the coefficients can be performed from the beginning and the end simultaneously.

The invention adopts a pair of complementary clocks with the same frequency and opposite phases, so that the data change of the counters controlled by the two clocks can be separated by half a period, and no interference is generated when the data are input to a decoder. The decoder selects proper preceding stage data to be sent to the K bit multiplication unit in real time according to the two counters, and the data sent to the K bit multiplication unit every time are spaced by half a clock period. If the first-stage CLA adder has data overflow, a data overflow processing module is called. The data operation module sends the processed data to the second-stage CLA adder, and the second-stage CLA adder outputs a correct operation result under the control of the counter.

The invention has the beneficial effects that:

compared with the prior art, the method has the advantages that the partial product is reduced by utilizing the karatsuba algorithm, so that the multiplication speed is improved, the resource consumption and the time delay of a critical path are reduced, and the chip cost is saved. Meanwhile, the consumption of the clock is saved by utilizing the optimization of the complementary clock, the period of the operation clock is reduced to a half, and the operation efficiency is improved.

Compared with the traditional parallel multiplier, the invention has the advantages that the operation speed is improved by 3-4 times; compared with other types of multipliers, the speed is higher, the circuit is simpler, and the occupied resources are less.

Description of the drawings:

FIG. 1 is a flow chart of an embodiment of the present invention;

fig. 2 is a diagram of an implementation architecture of the present invention.

The specific implementation mode is as follows:

the invention is further described with reference to the accompanying drawings and specific examples.

Fig. 2 shows a high-speed large-bit-width multiplier according to the present invention, which includes two complementary clocks, a CLA adder, an overflow processing module, a decoder, a K-bit multiplying unit, and a data operation module;

the CLA adder, namely the carry look ahead adder, is characterized in that carry signals of all stages are generated simultaneously, and the time for generating carry is greatly reduced. Here the adder will divide the X multiplier { a_k-1a_k-2…a₁a₀Each portion is added two by two, e.g. a₀+a₁，a₀+a₂，a₀+a₃，a₁+a₂，......，a_k-1+a_k(ii) a Similarly, for the Y multiplier { b }_k-1b_k-2...b₁b₀Is given by b₀+b₁，b₀+b₂，b₀+b₃，b₁+b₂，......，b_k-1+b_k；

The overflow processing module mainly judges the result of the first-stage CLA adder, and when the bit width of the result is more than K bits, the output result of the module is sent to a decoder and a data operation module at the lower stage under the control of a clock;

the decoder mainly decodes the code of the counter, selects the data sent from the front stage, stores the data in a corresponding register and sends the data to the K bit multiplication unit for operation;

the K bit multiplying unit adopts a partial product generator with two multipliers of K bits and 2 x K bits of output result, and the type of the partial product generator can comprise a pipeline type, a Booth type and the like;

The high-speed large-bit-width multiplier applies the karatsuba algorithm to the hardware realization of a parallel multiplier, integrates parts into two groups, and controls a decoder by using different codes formed by a counter 1 and a counter 2 which are controlled by the two clocks. When the rising edge of the clock comes, the corresponding multipliers are sequentially sent into the K bit multiplication unit to finish multiplication operation and shift addition operation, and the final multiplication result can be obtained.

The invention adopts a pair of complementary clocks with the same frequency and opposite phases, so that the data change of the counter 1 and the counter 2 controlled by the two clocks in the figure 2 can be separated by half a period, and the data change can not generate interference when being input to a decoder. The decoder selects proper preceding stage data to be sent to the K bit multiplication unit in real time according to the counter 1 and the counter 2, and the data sent to the K bit multiplication unit every time are separated by half clock period. In FIG. 2, if the first stage CLA adder has data overflow, the data overflow handling module is called. The overflow processing module sends the low-K bits of the result of the CLA adder to the decoder, and the highest bit of the result is sent to the data operation module, so that the multiplication result is correct. The data operation module sends the processed data to the second-stage CLA adder, and the second-stage CLA adder outputs a correct operation result under the control of the counter 1.

The high-speed big bit width number multiplier of the invention uses the karatsuba algorithm to reduce the partial product in the big bit width multiplication operation process, and the specific realization method is as follows: of two n bits to be inputAnd dividing the large bit width x and y into m k bits, wherein m is greater than 0, and k is greater than 0. Let X be { a ═ a_k-1a_k-2…a₁a₀}，Y＝{b_k-1b_k-2…b₁b₀The following are provided:

X*Y＝a₀*b₀+(a₁*b₀+a₀*b₁)*2^k+...+2^2(m-1)^ka_k-1b_k-1 (1)

among these, according to the karatsuba algorithm:

a₁*b₀+a₀*b₁＝(a₀+a₁)*(b₀+b₁)-a₀*b₀-a₁*b₁ (2)

X*Y＝a₀*b₀+[(a₀+a₁)*(b₀+b₁)-a₀*b₀-a₁*b₁]*2^k+...+2^2(m-1)ka_k-1b_k-1 (3)

so that the number of partial products is reduced from m2 to

Namely, it is

And (4) respectively.

The input two multipliers X, Y are 256 bits wide and the output Q is 512 bits wide in this example. The two input multipliers are respectively divided into 4 64-bit numbers, namely X ═ a₃a₂a₁a₀}，Y＝{b₃b₂b₁b₀Q ═ X ═ Y ═ a }₀*b₀+(a₁*b₀+a₀*b₁)*2⁶⁴+(a₁*b₁+a₂*b₀+a₀*b₂)*2¹²⁸+(a₃*b₀+a₀*b₃+a₁*b₂+a₂*b₁)*2¹⁹²+(a₃*b₁+a₁*b₃+a₂*b₂)*2²⁵⁶+(a₃*b₂+a₂*b₃)*2³²⁰+2³⁸⁴a₃b₃ (4)

Wherein, according to k_arat_sub_aThe expression after algorithm replacement is:

Q＝a₀*b₀+[(a₀+a₁)*(b₀+b₁)-a₀*b₀-a₁*b₁]*2⁶⁴+[a₁*b₁+(a₀+a₂)*(b₀+b₂)-a₀*b₀-a₂*b₂]*2¹²⁸+[(a₀+a₃)*(b₀+b₃)-a₃*b₃-a₀*b₀+(a₁+a₂)*(b₁+b₂)-a₁*b₁-a₂*b₂]*2¹⁹²+[(a₁+a₃)*(b₁+b₃)-a₁*b₁-a₃*b₃+a₂*b₂]*2²⁵⁶+[(a₃+a₂)*(b₃+b₂)-a₃*b₃-a₂*b₂]*2³²⁰+2³⁸⁴a₃b₃ (5)

comparing the above expressions, it can be seen that since the same expression only needs to perform multiplication once, the number of partial products is reduced from 16 to 10, and when the bit width of the multiplier is further increased, the partial products are reduced more.

Based on the above principle, the present example designs a multiplier as follows. According to fig. 1, when the input enable signal req _ valid is invalid, the output is always 0; when the input enable signal req _ valid is valid, the two multipliers divide the resulting 8The 64-bit number is fed into 8 registers, respectively. Then a is calculated in the CLA adder₀+a₁，a₀+a₂，a₀+a₃，a₁+a₂，a₁+a₃，a₂+a₃，b₀+b₁，b₀+b₂，b₀+b₃，b₁+b₂，b₁+b₃，b₂+b₃Then into another 8 registers. Since the bottom multiplication unit used in this example is a 64-bit pipelined multiplication unit, the addition described above is a 64-bit addition, which may reach 65 bits, which will generate a data overflow. Therefore, a judgment of data overflow is set here, and when the data overflow, the data overflow enters an overflow processing module. The principle of the method is the same as that of a multiplier, 65-bit data is divided into 1 bit and 64 bits for multiplication, when the multipliers are all 64 bits, the next link is entered, and the rest operation is completed in a shifting mode.

The decoder is controlled by a pair of complementary clocks clk1 and clk2 present in the multiplier of this example, which control the respective counter 1 and counter 2 to form different codes. When the rising edge of the clock comes, the corresponding 64-bit multiplier is sent to the multiplying unit in sequence to finish the multiplication operation. Since the coefficient multiplications of the formula (5) are different from each other and do not generate interference, the operation of the clk1 and the clk2 is completed from the head end and the tail end respectively, so that the operation speed is doubled, the clock efficiency is improved, and the consumption of resources is reduced. Finally, the data output by the two 64-bit multipliers passes through the data operation unit and the CLA adder, and correct data is output when the output enable is valid.

The simulation result of the multiplier of the invention is as follows:

1. the experimental environment is as follows:

the multiplier of the embodiment uses Verilog HDL language to carry out code design, carries out simulation verification in vcs _ vM-2017.03, carries out synthesis under 55nmCMOS process by using a synthesis tool DC-2014, and carries out layout automatic layout and wiring by using INNOVUS.

Three groups of experimental data are randomly selected for pre-simulation, and the result of the simulation is correct. By using DC-2014 for static timing analysis, the whole chip can work correctly under a 100MHz clock.

In order to ensure the reliability of the chip, INNOVUS is used for extracting signal delay caused by standard units and connecting lines in a layout, and three groups of data of front simulation are still used for carrying out post simulation verification. The results are all correct after verification.

Comparative experiment:

compared with the traditional parallel multiplier, the invention only adopts two 64-bit multiplying units to ensure the fairness of comparison.

2. Results of the experiment

	Multiplier of the present example	Conventional parallel multiplier
			Bit width of operation	256 bits	256 bits
Number of bottom multiplying units	2 64 bits	2 64 bits
			Process for the preparation of a coating	55nmCMOS	55nmCMOS
Speed of operation	2.5 clock period	8 clock period
			Number of logic gates	7 ten thousand	7.5 ten thousand

From simulation results, the multiplier of the present embodiment can output correct operation results in 2.5 clock cycles, whereas the conventional parallel multiplier needs 8 clock cycles to output results. Therefore, the operation speed of the multiplier of the embodiment is 3 times that of the traditional parallel multiplier, and compared with other types of multipliers, the multiplier has the advantages of higher speed, simpler circuit and less occupied resources.

Claims

1. A high-speed big bit width multiplier is characterized by comprising two complementary clocks, a CLA adder, an overflow processing module, a decoder, a K bit multiplying unit and a data operation module;

the CLA adder is used for adding each part of the divided multipliers pairwise;

the overflow processing module is used for judging the result of the first-stage CLA adder, and when the bit width of the result is more than K bits, the output result of the module is sent to a decoder and a data operation module at the lower stage under the control of a counter;

the decoder selects the data sent from the front stage by decoding the code of the counter, stores the data in a corresponding register and then sends the data to the K bit multiplication unit for operation;

the K bit multiplying unit adopts a partial product generator with two multipliers of K bits and an output result of 2 x K bits;

the data operation module carries out shift addition on the operation result of the K bit multiplication unit under the control of the counter, the output result enters the second-stage CLA adder for final calculation, and the second-stage CLA adder outputs the correct operation result under the control of the counter.

2. The circuit structure of a high-speed large-bit-width multiplier according to claim 1, wherein said overflow handling module sends the output result of the module to the decoder and data operation module at the lower stage by: the low K bits of the result of the CLA adder are sent to a decoder, and the highest bit of the result is sent to a data operation module.

3. A high-speed large-bit-width multiplier according to claim 1, wherein the operation method of said multiplier is: the karatsuba algorithm is applied to hardware of a parallel multiplier, parts are integrated into two groups, and the two complementary clocks control corresponding counters to form different codes to realize control of a decoder; when the rising edge of the clock comes, the corresponding multipliers are sequentially sent into the K bit multiplication unit to finish multiplication operation and shift addition operation, and the final multiplication result can be obtained.

4. A high-speed large-bit-width multiplier according to claim 3, wherein the karatsuba algorithm is applied to the hardware of the parallel multiplier, and can be used to reduce the partial product during the large-bit-width multiplication, and the specific method is as follows:

dividing the input large bit width x and y of two n bits into m k bits, wherein m is greater than 0, and k is greater than 0;

let X be { a ═ a_k-1a_k-2…a₁a₀}，Y＝{b_k-1b_k-2…b₁b₀}

Then there are:

X*Y＝a₀*b₀+(a₁*b₀+a₀*b₁)*2^k+…+2^2(m-1)ka_k-1b_k-1 (1)

among these, according to the karatsuba algorithm:

a₁*b₀+a₀*b₁＝(a₀+a₁)*(b₀+b₁)-a₀*b₀-a₁*b₁ (2)

the number of partial products is from m²Is reduced to

Namely, it is

A plurality of;

since the coefficients in the formula (3) do not affect each other from the first term and from the last term, the multiplication of the coefficients can be performed from the beginning and the end simultaneously.