CN101458617A

CN101458617A - 32 bit integer multiplier based on CISC microprocessor

Info

Publication number: CN101458617A
Application number: CNA2008101759220A
Authority: CN
Inventors: 高德远; 王党辉; 王得利; 樊晓桠; 张盛兵; 黄小平; 魏廷存; 张萌
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2008-01-22
Filing date: 2008-10-29
Publication date: 2009-06-17
Anticipated expiration: 2028-10-29
Also published as: CN100595729C

Abstract

The invention discloses a 32 bit integer multiplier, belonging to the computer microprocessor design field, comprising a 4-2 compressor, characterized in that, the 4-2 compressor is a third level 4-2 compressor array for displaying the multiplier can complete 32 bit multiply operation with symbol or without symbol, after expanding the multiplicand through the symbol, a 4 base Booth encoder based on 4 is used, 16 partial products can be generated by the multiplicand register; a third level pipelining is adopted, computed results can be returned in batch, low 32 bit part of the second step returning result, high 32 bit part of the third part returning result, 32 bit part of the result bus; once multiply operation can be controlled and completed by three microinstructions or two microinstructions. According to the third level 4-2 compressor array design, the microinstruction is used for controlling and satisfying various multiply operations of different opportunities; the generation of the 4 base Booth coding partial product with 32 bit operand with symbol or without symbol is simplified from 17 to 16, the structure of the multiplier is simplified, the multiplication time delay can be reduced.

Description

32 integer multiplier based on the CISC microprocessor

Technical field

The present invention relates to a kind of 32 integer multiplier based on the CISC microprocessor.

Background technology

With reference to Fig. 6.There are two kinds of multiplying orders in the X86 of the Intel instruction, have symbol to take advantage of and do not have symbol and take advantage of.Therefore the multiplier and the multiplicand most significant digit that participate in computing may also may be is-not symbol position for sign bit, for increasing operation rate, there is symbol not have symbol mixed type multiplier based on normal employing of CISC (the Complex InStruction Computer sophisticated vocabulary microprocessor) multiplier of Intel.Therefore two number averages that participate in computing occur with complement form.

If two multiplier A and multiplicand B with complement representation do multiplying, multiplier A bit wide is N, i.e. A[N-1:0], and to establish N be even number, then multiplier A can be expressed as:

A＝-A _N-1×2 ^N-1+A _N-2×2 ^N-2+…A ₁×2 ¹+A ₀×2 ⁰

＝(-A _N-1+A _N-2)×2 ^N-1+(-A _N-2+A _N-3)×2 ^N-2+…+(-A ₂+A ₁)×2 ²+(-A ₁+A ₀)×2 ¹+(-A ₀+0)×2

＝(-2A _N-1+A _N-2+A _N-3)×2 ^N-2+(-2A _N-3+A _N-4+A _N-5)×2 ^N-4++(2A ₃+A ₂+A ₁)×2 ²+(-2A ₁+A ₀+0)×2

So multiplying each other, A and B can be expressed as

A \times B = (Σ_{i = 0, A_{- 1} = 0}^{\frac{N}{2} - 1} ({- 2 A}_{2 i + 1} + A_{2 i} + A_{2 i - 1}) \times 2^{2 i}) \times B

= Σ_{i = 0, A_{- 1} = 0}^{\frac{N}{2} - 1} (({- 2 A}_{2 i + 1} + A_{2 i} + A_{2 i - 1}) \times B) \times 2^{2 i}

For (2A _2i+1+ A _2i+ A _2i-1) * B part can adopt preprocess method, is converted into simple a few number addition.Multiplicand is through obtaining four effective results after the pre-service, is that the negate of twice, the multiplicand of multiplicand itself, multiplicand adds 1 respectively, adds 1 again after getting after the multiplicand twice.Can be according to A _2i+1, A _2i, A _2i-1Value from following table, choose processing to multiplicand, 2B wherein refers to multiplicand is directly moved to left one ,-B refers to the multiplicand negate, and-2B is with the multiplicand negate again after that moves to left.When B being multiply by a negative value, B taken advantage of corresponding multiple earlier after, also want negate to add 1 again.What E represented is exactly this numerical value.

A _2i+1	A _2i	A _2i-1	Operation to B	E	_i
A _2i+1	A _2i	A _2i-1	Operation to B	E	_i	0	0	0	0·B	0
0	0	1	+1·B	0		0	0	0	0·B	0
0	0	1	+1·B	0	0	1	0	+1·B	0
0	1	1	+2·B	0	0	1	0	+1·B	0
0	1	1	+2·B	0	1	0	0	-2·B	+1
1	0	1	-1·B	+1	1	0	0	-2·B	+1
1	0	1	-1·B	+1	1	1	0	-1·B	+1
1	1	1	0·B	0	1	1	0	-1·B	+1

It below promptly is the principle of this (Bu Si) algorithm of basic 4 cloth.Any one ((2A wherein _2i+1+ A _2i+ A2 _I-1) * B * 2 ²ⁱ) be a partial product of generation, from derivation, be that two multipliers of even bit N multiply each other for bit wide, will generate N/2 partial product.

Present 32 general practices that have symbol not have symbol mixed type multiplier are: carry out sign extended earlier, be extended to two 33 complementary operation number, because it is even number that Booth requires bit wide, so require to carry out the expansion of two bit signs, re-use basic 4 Booths and generate 17 partial products, adopt compressor reducer to carry out the partial product summation then, obtain two operation results such as the 4-2 compressor reducer.When adopting the 4-2 compressor reducer, need [log ₄K] a level 4-2 compressor reducer just can obtain two final values, and wherein k refers to the number of partial product, and the side expands number to refer to and satisfies more than or equal to log ₄The integer of the minimum of k.Thereby 17 partial products just need 4 grades of 4-2 compressions.These two operation results carry out additive operation again, finally obtain 64 bit arithmetic values.Multiplier architecture figure such as Fig. 6 of this method design.

Multiplier by this method design has several shortcomings: 1, it is even width that this computing of base 4 cloth requires the multiplier of participation computing, so after being extended to 34 with 32, the partial product that generates reaches 17, this causes many 4-2 compressor reducer computing items imperfect when carrying out the 4-2 compression in the back, area has very big waste, because be level Four 4-2 compression, operation time also can be long.2, finally once generate 64 results, when writing back register, the requirement result common bus is 64 like this.But the result of most X86 instruction once-through operations mostly is 32 most, single for multiplying with 1 times of bus bit wide expansion and be unworthy.And if inferiorly write back 64 results simultaneously, and can increase the weight of the burden of parts such as register file, instruction dispatch tracking, cause the requirement register file to support multiport to write such as meeting, the increase of data intersecting chain list item, and the register bypass logic is complicated.3, simultaneously different X86 multiplying orders is also different for result's requirement, 64 results of IMUL r/m32 class command request multiplication divide high 32 to write back two different registers with low 32, and IMUL r32, low 32 of 64 results of r/m32 class command request multiplication write back certain register, and high 32 results give up.Therefore, the calculating process of multiplier should be according to the difference of instruction, in good time end multiply operation, and return correct result.And the designed microprocessor of conventional method adaptability is relatively poor in this respect.

Summary of the invention

In order to overcome the above-mentioned deficiency of prior art, the invention provides a kind of 32 integer multiplier based on the CISC microprocessor, this multiplier time cycle is few, area is little, the result relevant uncomplicated, reusability is high.

The technical solution adopted for the present invention to solve the technical problems: a kind of 32 integer multiplier based on the CISC microprocessor, comprise the 4-2 compressor reducer, be characterized in that described 4-2 compressor reducer is three grades of 4-2 compressor reducer arrays, show that this multiplier can finish symbol or not have 32 multiplyings of symbol, after multiplicand process sign extended, use generates 16 partial products based on this coding of cloth of 4 by multiplicand register;

This multiplier adopts three grades of flowing water, returns result of calculation in batches, low 32 bit positions of second count return results, triple time return results high 32 bit positions, 32 of result bus;

This multiplier is finished multiplication operation by three micro-orders or two micro-order controls.

The invention has the beneficial effects as follows: because this multiplier takes all factors into consideration from system-level viewpoint, avoid the single performance element design of setting about from function and cause with the unmatched drawback of architecture.Designed multiplier to architecture register reservation station and label judge, public result bus bit wide equal pressure is less.After determining micro-order, adopt best three class pipeline to satisfy the various multiply operations of difference demand on opportunity.Simultaneously to expand sign bit have symbol not have the long-pending generation of this coded portions of cloth of symbol 32 bit manipulation bases 4 to be reduced to 16 from 17, not only to have reduced the structure of multiplier, also effectively reduced the multiplication time-delay.Final designed multiplier architecture compactness, area is also little than the prior art multiplier, the result relevant uncomplicated, reusability is high.

Below in conjunction with drawings and Examples the present invention is elaborated.

Description of drawings

Fig. 1 is 32 integer multiplier structural drawing that the present invention is based on the CISC microprocessor.

Fig. 2 is 16 partial product primitive form synoptic diagram that the partial product maker generates among Fig. 1.

Fig. 3 is 16 partial product symbolic simplification form synoptic diagram that the partial product maker generates among Fig. 1.

Fig. 4 is 4-2 compressor configuration figure among Fig. 1.

Fig. 5 is a 4-2 compressor reducer array of figure among Fig. 1.

Fig. 6 is background technology multiplier architecture figure.

Embodiment

With reference to Fig. 1～5.Consider the Intel order property, its general form is OPcode A B, two source operand A and B operation, end product still places A, that is to say, A had both made one of them source address, again as destination address, thereby in order to keep pro forma unification as far as possible, CISC types of microprocessors micro-order also adopts same form usually: microcode AB.Such as an addition microoperation, A and B addition, the result finally writes back A.For making full use of of resource, multiplier should have been finished symbol simultaneously and take advantage of and do not have a symbol multiplication.For micro-order, it should minimumly comprise one has symbol to take advantage of mul, and does not have symbol and take advantage of imul.For the instruction of single operand, can be in micro-order, the operand EAX that it is implicit indicates in micro-order.But because its operation result is 64, need write back in two 32 different bit registers, when judgement writes back register,, can only write back wherein a part of result only according to the destination register numbering.Unless becoming dual-port, register writes, and when writing back current operation is remake decoding, usually the logic that adopts is: if (microcode==mul), then EAX ← Mul_result[31:0], EDX ← Mul_result[63:32]. can cause the irregularity of logic when writing back register like this, increase the weight of to write back the logic burden.Also can adopt identical disposal route for the single-operand instruction form that sign multiplication is arranged.But for dual-operand that symbol is arranged and such processing of 3-operand is infeasible.Because their result only gets low 32, and high 32 results are given up.If require so once to write back whole results, only there are two micro-order mul and imul still not enough, minimumly should add an imult micro-order, only get low 32 situation so that solve operation result.

The pipeline organization that typical microprocessor adopts, Main Stage is got finger, deciphers, peeks, carries out and is write back.The data that occur when instruction is carried out are relevant and streamline obstruction that cause can influence the performance of track performance greatly.It is relevant to occur the read-after-write data according to the order of sequence in the execution pipeline.Some 32 multiplication operation once produces two 32 results, if once write back this two results, require too microinstruction decode during to the design of bypass, two operands in next bar micro-order all will be made comparisons relevant to determine whether having data with EDX and EAX.Further consider that if adopt the register renaming technology to support out of order execution, can cause that the judgement of architecture register reservation station and label is complicated, public result bus bit wide increases and the utilization factor reduction, these all can increase the pressure to sequential.Thereby need to consider that timesharing write back two times result, once write back one simply more many than once writing two results of the Huis as a result the time to the logic of register write back and bypass.

Learn from last surface analysis, should avoid once writing back two when writing back register.This just requires should the result be write back at twice in Multiplier Design as far as possible, that is to say that minimum should two bat finish a multiplying order.But owing to there is the such multiplying order of IMULAB C, B and C multiplied result are not to write back B but be positioned among the A, like this in first count, must indicate two the operand B and the C that participate in computing, and microcode A category-B micro-order has only two operand fields, can't indicate the destination address A that will write back in this bat.So micro-order should be added a bat at least, this bat is used for indicating two operands that participate in computing, thereby finishes a multiplying order minimum triple time.The content of micro-order has also just been determined basically like this, and first count reads in two operands, and second count can write back low 32 results, writes back high 32 results triple time.In hardware corresponding to 160.In addition, in other complicated order, also can use the multiplication microoperation, through its form of statistics is mul A, B, just A and B multiplied result place A, and only need hang down 32 result, promptly only need two micro-orders, thereby a multiply operation should have certain elasticity, to satisfy the opportunity of different multiplication result demands, if control just little being fit in stage that multiplication is carried out with state machine.So only use micro-order to control the different phase of multiplication, this just requires to be easy to distinguish is which claps instruction.In conjunction with microinstruction format, there is one to write return wb (write back), writing return at first beat of streamline is 0, shows not write back any result.Second count and triple time will write back the result, and writing return wb should be 1.Associating multiplying order also needs the updating mark register after finishing, and can use updating mark register-bit flag, and making flag at second count is to be 1 just to have distinguished this two execute phases 0, the triple time.Simultaneously, adopt this method only need in the operand field of correspondence, insert correct register number, just can only represent all types of multiplying orders, avoided increasing the possibility of the bit wide of micro-order with two microcode MUL and IMUL.

So pairing micro-order of multiplying is as follows, is example with MUL EBX:

mul EAX，EBX no wb no flag

mul EAX，EBX wb no flag

mul EDX，EBX wb flag

Surface analysis on the process can be learnt, it is optimum that 32 multiply operations are finished in employing triple time.Also be extended to 33 for two source operands that symbol is arranged by sign bit, so unification is that two 33 the sign multiplication that has multiplies each other.This is encoded with basic 4 cloth, and to control generating portion long-pending, realizes the summation of partial product then with the tree structure of 4-2 compressor reducer composition.Final two results that produce sue for peace with totalizer.

Concerning this coding of cloth, a precondition is arranged, the bit wide that is exactly multiplier A is an even number, but since will have symbol take advantage of and do not have symbol take advantage of unified after, be 33 by the bit wide of sign extended multiplier A, this just need expand one again with A, just becomes 34.The partial product that obtains has 17, and multiplier can use the 4-2 compressor reducer to accelerate summation speed after obtaining partial product, and this just requires to make that partial product is 4 multiple as far as possible.Therefore partial product need be reduced to 16.This coding of cloth is analyzed, if multiplier A is expanded to 34 bit wides, then for IMUL, A33, A32, A31 have only two kinds may: be 0 entirely or be 1 entirely.Table can find that the partial product of both of these case all is 0 before the contrast.And concerning MUL, A33, A32, A31 also has only two kinds of possibilities: 000 or 001, and table can find that 000 o'clock partial product also be 0 before the contrast, when having only 001, partial product is a multiplicand, again because this time i=16, that is to say this partial product only to final multiplication result high 32 influential.Consider and above-mentionedly will write back low 32 results at second count, and writing back high 32 results triple time, thereby this part in the end one just can be added to when clapping on high 32 of net result, guaranteed only to produce 16 partial products like this, reduce the quantity of 4-2 compressor reducer, also can relax the preceding two sequential pressure of clapping.Sign extended also needs one to get final product simultaneously.Like this, 64 totalizers can be split becomes two 32 multiplier, and high 32 multipliers can time-sharing multiplex.Sign extended correspondence 110, partial product generates corresponding 120.32 additions of height corresponding 140 and 150.

Each partial product is because all want corresponding Ei, direct added-time on this partial product, can cause the inconvenience of calculating, and this value can be added tail end to next partial product.Shown in Figure 2 is the partial product that obtains through after this coding of cloth.Wherein Si is the sign extended of partial product, if partial product for just, then Si is 1, otherwise is 0.

According to conversion:

Wherein,

\hat{S} &CirclePlus; S = 1

After the simplification, can effectively reduce power consumption and area.

After generating 16 partial products, these 16 partial products all need be sued for peace, structure adopts the parallel summation of a plurality of 4-2 compressor reducers faster.The compound with regular structure of 4-2 compressor reducer utilizes VLSI to realize.The 4-2 compressor reducer as shown in Figure 5, wherein L1, L2, L3, L4 are four input positions, CIN is last one carry, output S is an operation result, COUT is the input carry to next bit, CARRY is second output of compressor reducer.In Fig. 5, COUT generates after through the two-stage gate delay after effectively in input position, also is just to need CIN later at the two-stage gate delay and calculate, and four addends of therefore many groups can carry out additive operation simultaneously.Per four partial products need obtain two groups of outputs with many groups 4-2 compressor reducer, after first order 4-2 compression, can obtain 8 groups of outputs, pass through one-level again after, obtain 4 groups of outputs, pass through third level compression after, obtain two groups of outputs.

Owing to will finish the multiply operation of two 32 positional operands triple time, therefore the logic function of wanting three different cycles of reasonable distribution to finish is to finish low 32 computing at second count but a prerequisite is arranged, and finishes high 32 computing triple time.Because main time delay is the carry of low level to a high position, thereby low 32 result of calculation is to be bound to fulfil ahead of schedule than high 32.Wherein a kind of method can for: first order register is placed after the 4-2 compressor reducer of the second level, and second level register is placed two groups of outputs that third level 4-2 compression is produced with selecting after the addition of add with carry musical instruments used in a Buddhist or Taoist mass.Like this, in first cycle of multiplication, need finish sign extended, the generation of partial product and preceding two-stage 4-2 compression, and second beat finished third level 4-2 compression and produced 64 multiplication results, the 3rd beat judges whether to add multiplicand obtaining final multiplication result on high 32, and according to multiplication result corresponding marker bit is set.Whether wherein add an enable signal before each grade logical organization, three grades of enable signals are to be determined jointly by microcode and flag and wb position, whether move this grade operation and intermediate result will be squeezed in the register in order to decision.

Claims

1, a kind of 30 two-digit integer multipliers, comprise the 4-2 compressor reducer, it is characterized in that: described 4-2 compressor reducer is three grades of 4-2 compressor reducer arrays, show that this multiplier can finish symbol or not have 32 multiplyings of symbol, after multiplicand process sign extended, use generates 16 partial products based on this coding of cloth of 4 by multiplicand register;