CN102520906A

CN102520906A - Vector dot product accumulating network supporting reconfigurable fixed floating point and configurable vector length

Info

Publication number: CN102520906A
Application number: CN2011104130015A
Authority: CN
Inventors: 王东琳; 汪涛; 尹磊祖
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2011-12-13
Filing date: 2011-12-13
Publication date: 2012-06-27

Abstract

The invention discloses a vector dot product accumulating network supporting reconfigurable fixed floating point and configurable vector length, comprising a parallel reconfigurable multiplying unit for receiving vectors B, C, FBS and U as input, and performing vector multiplying operation to obtain multiplying result B*C of the vectors B and C; a floating point index and mantissa pre-processing part for receiving multiplying result B*C of the parallel reconfigurable multiplying unit and scalar A as input, finishing operations of selecting maximum floating point index, subtracting the indexes, shifting and aligning, complementing bits and converting and sticky compensating to obtain processed vector result B*C and scalar result A; a reconfigurable compressor for receiving processing result of the floating point index and mantissa pre-processing part, and compressing the result to obtain a sum string S and a binary string C; and a floating point index and mantissa post-processing/fixed point operating part for receiving the sum string S and the binary string C of the reconfigurable compressor, and finishing mantissa addition and post-processing to obtain a final vector dot product accumulating result.

Description

Support the configurable dot product of the reconfigurable vector length of the fixed and floating network that adds up

Technical field

The present invention relates to the high-performance digital signal processor technical field, relate in particular to the configurable dot product of the reconfigurable vector length of a kind of support fixed and floating network that adds up.

Background technology

In the digital processing field, digital signal processor (DSP) is the core of total system in modern times, and the performance of DSP directly determines the performance of total system.In DSP, no matter how complex calculations all will be transferred to arithmetic element is realized, so arithmetic element is the core component of whole DSP, and its computing power is to weigh the leading indicator of DSP performance.Especially, along with the continuous development of technology, be the high computation-intensive field of representative with modern radar signal Processing, piggyback satellite Flame Image Process, compression of images, HD video etc., increasingly high to the ability of signal Processing.This has proposed increasingly high challenge to arithmetic element, and is particularly self-evident to the pressure variable-sized, that the high density parallel computation brings to arithmetic element of specific area.

In the digital signal processing, used " dot product " operation in a large number, in modern times like FFT, FIR filtering, signal correction etc.All these operations all are that this just shows as two vectors (or title sequence) are carried out dot product with input signal and coefficient or the local parameter back integration (adding up) that multiplies each other.For dot-product operation, at present main flow DSP does not also have special-purpose instruction, generally through multiplication, add up or many instructions of multiply accumulating are accomplished.These some shortcomings of technology ubiquity:

1) the hardware utilization factor is low.For the dot product operation, only utilized scalar multiplication, added up or scalar multiply accumulating resource, reasonably do not utilize the vector operation resource.

2) a little less than the processing power.Generally can only carry out 16/32 scalar dot product operation, can only be accomplished by a plurality of scalar dot product operations for dot product, data throughput capabilities is low, inefficiency.

3) performance period is long.Article one, dot-product operation needs many multiplication, adds up or multiply accumulating instruction serial completion, exists data relevant, and dot-product operation needs a plurality of clock period.

4) flexibility ratio is not high.Only can support the data layout that certain is specific or the dot-product operation of length-specific, and it is not configurable to participate in the data number of dot-product operation.

5) programming difficulty.Dot product operation by multiplication, add up or the multiply accumulating operation is accomplished, and these operations are not one-cycle instructions, need to consider the data degree of correlation of each bar between instructing.

Dot-product operation belongs to multioperand and calculates category, and multioperand is calculated sign bit expansion and the data precision that needs the selective analysis data.At present more existing patents have discussed how to realize the multioperand computing; Like application number is that the patent " the horizontal summation network of a kind of support fixed and floating restructural " of 201010162375.X has proposed a kind of multioperand additive operation, but its precision and whole data length of not analyzing floating point data format are not configurable.Application number is that 201010535666.9 patent " being used to carry out the instruction and the logic of dot-product operation " has proposed the thinking that a kind of dot product instruction is carried out, and data layout is configurable, but it remains the scalar dot product, and it is not configurable to participate in the data number of dot-product operation.Application number is that 201010559300.5 patent " being used for the multi-functional of SIMD vector microprocessor " has been introduced the thinking that a kind of vectorial floating point multiplication addition unit is realized; But it does not carry out reconstruct to fixed-point data, and it is not configurable to participate in the data number of multiply-add operation equally.

In the dependence of arithmetic stage analysis dot-product operation, can know that dot-product operation is by the multiplication and the operation completion that adds up; Therefore; Change the thinking of utilizing scalar operation parts execute vector dot product in the classic method, multiplexing existing vector multiplication resource increases the Internet resources that add up; Utilize the vector operation parts to come the operation of execute vector dot product, can fundamentally improve the ability of dot product computing.Simultaneously, analyze the correlativity of floating data and fixed-point data form, floating data is carried out pre-service, multiplexing fixed-point data path; Support different data layout and different data granularities through the restructural compress technique; Adopt the Mask register configuration to participate in the data number of dot-product operation; Thereby a kind of dot product of supporting different grain size, different data format, different vector lengths network that adds up is provided; So that dot product arithmetic capability powerful in the digital signal processing to be provided, be the problem that the present invention need solve.

Summary of the invention

The technical matters that (one) will solve

In view of this; Fundamental purpose of the present invention is to provide a kind of support fixed and floating restructural, the configurable dot product of the vector length network that adds up; Can support 8/16/32 fixed-point data, 32 to simplify IEEE-754 standard single-precision floating-point data dot-product operation; But flexible configuration is participated in the vector length of dot-product operation, rationally utilizes the vector operation resource, to improve the efficient and the processing power of dot product computing; Simplify dot product operational software programming complexity, satisfy dot product computing demand powerful in the digital signal processing.

(2) technical scheme

For achieving the above object; The invention provides the configurable dot product of the reconfigurable vector length of a kind of support fixed and floating network that adds up; Comprise: parallel restructural multiplier 1 is used to receive vector data B, C and data options FBS, U as input, the execute vector multiply operation; Obtain multiplication result B * C of vector data B, C, and export to floating-point index, mantissa's preprocessing part 2; Floating-point index, mantissa's preprocessing part 2; Multiplication result B * the C and the scalar data A that are used to receive parallel restructural multiplier 1 are as input; Accomplish to select floating-point index maximal value, that index is asked is poor, displacement alignment, complement code are changed and sticky position compensating operation; Vector result B * C after obtaining handling and scalar result A, and export to restructural compressor reducer part 3; Restructural compressor reducer part 3 is used to receive the result of floating-point index, mantissa's preprocessing part 2, and it is compressed, obtain " and string " (S) with " carry string " (C), and export to floating-point index, mantissa's aftertreatment/fixed-point operation part 4; And floating-point index, mantissa's aftertreatment/fixed-point operation part 4; Be used for " and string " S of being received from restructural compressor reducer part 3 and " carry string " C end of line of going forward side by side is counted addition, the result of mantissa of addition carried out aftertreatment obtain final dot product accumulation result.

(3) beneficial effect

The configurable dot product of the reconfigurable vector length of this support fixed and floating provided by the invention network that adds up; Floating data is carried out the multiplexing fixed-point data path of pre-service; Support different data layout and data granularity through the restructural compress technique, adopt Mask register flexible configuration to participate in the vector length of dot-product operation, the dot product computing of the IEEE-754 standard single-precision floating-point data of can support 8/16/32 fixed-point data, simplifying; But flexible configuration is participated in the vector length of dot-product operation; Operational performance is high, expense is little, function is many, encode less, speed is fast, has reduced the add up time-delay of critical path of floating point vector dot product, has reduced the resource that the fixed point vector dot product adds up and consumed; Simplify the complexity of software programming, improved code density.

Description of drawings

Fig. 1 is the add up structural representation of network of the configurable dot product of the reconfigurable vector length of support fixed and floating according to the embodiment of the invention;

Fig. 2 is that 8 * 8 multiplier group structures according to the embodiment of the invention are the synoptic diagram of 16 * 16 multipliers;

Fig. 3 is that 8 * 8 multiplier group structures according to the embodiment of the invention are the array of figure of 16 * 16 multipliers;

Fig. 4 is add up floating-point index in the network, mantissa's preprocessing part of the dot product according to the embodiment of the invention, and once relatively three indexes obtain the synoptic diagram of the big index network of floating-point;

Fig. 5 is add up floating-point index in the network, mantissa's preprocessing part of the dot product according to the embodiment of the invention, and cascade index comparator network obtains the synoptic diagram of floating-point maximal index;

Fig. 6 is add up floating-point index in the network, mantissa's preprocessing part of the dot product according to the embodiment of the invention, the synoptic diagram that floating-point coefficient handles;

Fig. 7 is the restructural compressor reducer part in the network that adds up of the dot product according to the embodiment of the invention, the synoptic diagram of 8/16/32 fixed point, 32 floating-point restructural compressor reducer networks;

Fig. 8 is the restructural compressor reducer part in the network that adds up of the dot product according to the embodiment of the invention, the synoptic diagram of 32 seat compressor reducers;

Fig. 9 is add up floating-point index in the network, mantissa's aftertreatment/fixed-point operation fractional part and the synoptic diagram of row index amending unit of the dot product according to the embodiment of the invention.

Embodiment

For making the object of the invention, technical scheme and advantage clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, to further explain of the present invention.

Principal feature of the present invention is: data layout restructural, vector length are configurable.The symbol of explanation below the agreement in the description process: the dot product instruction description is D=A+B DOTC{ (U) } { (M) } { (FBS) }; Wherein, A, D are 32 scalar datas, and B, C are 512 bit vector data, and DOT representes the dot product operational character; Mask is 64 bit registers, every octet of controlling B * C result respectively; M representes that the dot product operation that adds up is influenced by the Mask register, representes that when the M option does not exist the operation that adds up of Mask register pair dot product do not have influence; U representes not have the Symbol Option; FBS representes data layout, with 2 binary representations.32 fixed points are represented in " 00 ", and " 01 " represents 32 to simplify single-precision floating point, and octet is represented in " 10 ", and 16 half-words are represented in " 11 ".

Supposing in the embodiment of the invention that B, C are 512 bit vector data, is the occasion of 32 multiple bit wides but the present invention is applicable to any B, C, and the length relation of the width of Mask register and vectorial B, C is Length _Mask=Length _B/ 8.

As shown in Figure 1; Fig. 1 is the add up structural representation of network of the configurable dot product of the reconfigurable vector length of support fixed and floating according to the embodiment of the invention; This dot product network that adds up comprises successively the parallel restructural multiplier part 1 that connects; Floating-point index, mantissa's preprocessing part 2, restructural compressor reducer part 3, floating-point index, mantissa's aftertreatment/fixed-point operation part 4.Wherein:

Parallel restructural multiplier 1 receives vector data B, C and data options FBS, U as input, and according to corresponding data layout, the execute vector multiply operation obtains the result of vectorial B * C, and exports to floating-point index, mantissa's preprocessing part 2;

Multiplication result B * the C of floating-point index, the parallel restructural multiplier 1 of mantissa's preprocessing part 2 receptions and scalar data A are as input; Accomplish to select floating-point index maximal value, that index is asked is poor, the displacement alignment, complement code is changed and operation such as sticky position compensation; Vector result B * C after obtaining handling and scalar result A, and this result exported to restructural compressor reducer part 3;

Restructural compressor reducer part 3 receives the result of floating-point indexes, mantissa's preprocessing part 2, and it is compressed, obtain " and string " (S) with " carry string " (C), and export to floating-point index, mantissa's aftertreatment/fixed-point operation part 4;

" and string " S that floating-point index, mantissa's aftertreatment/fixed-point operation part 4 receive restructural compressor reducer part 3 carries out mantissa's addition with " carry string " C, for fixed point format, directly carries out result treatment and obtains fixed point vector dot product accumulation result; Carry out for floating-point format that leading 1 judgement, normalization shift, normalization are rounded off, operations such as index is adjusted, symbol adjustment, finally obtain floating point vector dot product accumulation result.

Below in conjunction with Fig. 2 to Fig. 9, introduce the configurable dot product of the reconfigurable vector length of the support fixed and floating provided by the invention network that adds up in detail.The present invention comprises parallel, cascade, restructural and configurable design aspect concrete realization.

Parallel restructural multiplier 1 adopts 16 identical 32 restructural multipliers 11, supports 8/16/32 fixed-point multiplication, and 32 IEEE754 standards are simplified the single-precision floating point multiplication, and 16 32 multiplier concurrent workings reach 16 * 32 handling capacity.These 32 restructural multipliers are formed by 8 * 8 basic multiplier group structures, and can accomplish 8/16/32 has/and no symbol fixed-point multiplication and 32 simplify the single-precision floating point multiplication, obtain position, 512 (=16 * 32) multiplication result.The multiplier of high-bit width is formed by the multiplier group structure than low-bit width; With 8 * 8 multipliers is elementary cell; 48 * 8 multipliers are 1 16 * 16 multiplier according to certain group structure that concerns, 4 16 * 16 multipliers are 1 32 * 32 multiplier according to certain group structure that concerns.

As shown in Figure 2, Fig. 2 is that 8 * 8 multiplier group structures according to the embodiment of the invention are the synoptic diagram of 16 * 16 multipliers, wherein shown in the following formula 1 of the mathematical description of 16 multipliers:

A * B=(AH * 2 ⁸+ AL) * (BH * 2 ⁸+ BL) (formula 1)

＝AH×BH×2 ¹⁶+(AH×BL+AL×BH)×2 ⁸+AL×BL

Wherein AH/BH, AL/BL are respectively high and low 8 of 16 bit data A/B.At first 48 * 8 multiplier concurrent workings, each 8 * 8 multiplier obtains 2 compression result through Wallace's compressed tree; Get into 1 8-2 compressor reducer after 8 of 4 multipliers compression result weights are alignd then together and obtain final " and string " S and " carry string " C; S and C obtain the high 24 of 16 * 16 multipliers through 24 totalizers, and directly from the least-significant byte of AL * BL multiplier, the least-significant byte carry of AL * BL multiplier output simultaneously is as the carry input of 24 multipliers for the other least-significant byte of 16 * 16 multipliers.At last, through 1 selector switch, when work is 8 * 8 patterns, 16 results of selector switch output AL * BL and AH * BL multiplier; When being operated in 16 * 16 patterns, 32 results of 16 * 16 multipliers that the result of 24 totalizers of selector switch output and the least-significant byte result of AL * BL constitute.

8 * 8 multiplier group structures that Fig. 3 has provided according to the embodiment of the invention are the array of figure of 16 * 16 multipliers, and in like manner 4 16 * 16 multipliers can constitute 1 32 * 32 multiplier according to identical method group.

What provide above is that low level multiplier group structure is the thought of high-order multiplier; But when the concrete realization of multiplier of the present invention, all adopt the saturated truncation mode of acquiescence; Promptly the result of 8 * 8 multipliers only gets least-significant byte, and when a high position had as a result, maximum/little value was got in the direct saturated processing of least-significant byte.In like manner for 16 * 16 multiplyings, the result gets low 16; 32 * 32 multiplyings, the result gets low 32.

As shown in Figure 1, said floating-point index, mantissa's operation part 2 comprise that index cascade comparer 21, index ask poor array 22, displacement alignment unit 23, complement code converting unit 24 and sticky position compensating unit 25; Said index cascade comparer 21 is used to obtain 17 floating-point index maximal value E _MaxSaid index asks poor array 22 to be used to obtain each floating-point index E _iWith floating-point index maximal value E _MaxPoor E _Max-E _i, this index difference be used to be shifted translocation distance of alignment unit 23; Said displacement alignment unit 23 adopts index to ask the output E of poor array 22 _Max-E _iAs control signal, carry out right-shift operation, floating-point coefficient is alignd; Floating-point coefficient carries out complementary operation after 24 pairs of specific displacements of said complement code converting unit, and the mantissa that need negate is sign bit and the different floating-point coefficient in maximum floating-point exponent sign position; Said sticky position 25 pairs of floating-point coefficients that shift out of compensating unit with need the floating-point coefficient of supplement sign indicating number to carry out a compensation, obtain 17 compensating units.

Floating-point index, mantissa's preprocessing part 2 obtain 16 floating-point multiplication results, and isolate the index E of floating-point successively from parallel restructural multiplier part 1 ₀-E ₁₅, the M of mantissa ₀-M ₁₅With sign bit S ₀-S ₁₅16 floating-point multiplication result exponent E ₀-E ₁₅With the floating-point index E in the scalar register ₁₆Get into cascade index comparator 21 and obtain maximum floating-point index E _Max, ask poor array 22 to obtain a floating-point index E successively through index then _iWith index maximal value E _MaxDifference DELTA E _i, it is 17 8 parallel subtracters that index is asked poor array.Displacement alignment unit 23 adopts 17 32 bit parallel shift units to carry out simultaneously, and the control signal of each shift unit is asked the output Δ E of poor array 22 from index _i, shift unit export to complement code converting unit 24, when

The time need carry out complementary operation to the floating-point coefficient after the corresponding displacement; Δ E to shifting out simultaneously _iPosition mantissa carries out the Sticky compensation, when

Contain 1 o'clock needs in the binary digit that (mantissa needs supplement) perhaps shifts out and carry out Sticky compensation, i.e. Sticky _i=1.Sticky position compensating unit 25 counts Sticky ₀-Sticky ₁₆In need carry out Sticky position compensation number Comp.Whole floating-point index, mantissa's preprocessing part are exported 512 bit vector mantissa, 32 scalar mantissa and 5 Sticky compensation position Comp.

When being operated in the fixed point pattern, floating-point index, mantissa's preprocessing part 2 are directly exported the fixed point result through an other paths, and fixed-point data need not passed through any processing in this part.

As shown in Figure 4, Fig. 4 is add up floating-point index in the network, mantissa's preprocessing part of the dot product according to the embodiment of the invention, and once relatively three indexes obtain the synoptic diagram of the big index network of floating-point.In order to reduce critical path time-delay, adopt each relatively 3 numbers relatively the time carrying out the cascade of floating-point index, 38 bit comparator concurrent workings produce corresponding marker bit respectively, select the maximal value in output 3 numbers according to zone bit separately then.

As shown in Figure 5, Fig. 5 is add up floating-point index in the network, mantissa's preprocessing part of the dot product according to the embodiment of the invention, and cascade index comparator network obtains the synoptic diagram of floating-point maximal index.E ₀-E ₁₆Import first order comparator array respectively, obtain 6 " higher values "; These 6 " higher values " get into second and third grade comparator array successively then, finally obtain floating-point index maximal value E _MaxThe cascade comparer is by original

the level level that is reduced to like this; Reduced the time-delay of 2 grade of 8 bit comparator; Some corresponding control logic have been increased; But the steering logic time-delay is significantly smaller than 8 bit comparators, has reduced the time-delay of cascade index comparator generally.

As shown in Figure 6, Fig. 6 is add up floating-point index in the network, mantissa's preprocessing part of the dot product according to the embodiment of the invention, the synoptic diagram that floating-point coefficient handles.For the precision that keeps the floating-point dot product to calculate satisfies the IEEE754 standard, floating-point coefficient is carried out 7 mantissa's expansions, 24 mantissa move to left 7; In order to make multiplexing fixed point mantissa of floating-point coefficient compression path, mend 0 in floating-point coefficient's most significant digit simultaneously, such 24 floating-point coefficients are extended to 32.At when alignment displacement Δ E that moves to right of 32 mantissa _iThe Δ E that shifts out is preserved in the position _iThe position; When

The time 32 mantissa need carry out the supplement operation; Sticky position compensating unit receives

With the Δ E that moves to right out _iThe position as control signal, when

Perhaps Δ E to moving to right out _iPosition stipulations or be operating as at 1 o'clock are carried out the complement code compensation, i.e. Sticky _i=1, otherwise Sticky _i=0.

As shown in Figure 1, restructural compressor reducer part 3 is accomplished 8/16/32 fixed point and 32 floating-point coefficient's compressions, comprises Mask screen unit 31, sign bit expanding element 32 and restructural compressor reducer network 33.Wherein, Mask screen unit 31 receives the output of floating-point index, mantissa's preprocessing part, and analyzes the value of Mask register, and whether Mask register controlled vector registor participates in the dot product operation.When the M option was effective, having only the Mask register was that the value of 1 corresponding indication just gets into the compressor reducer network; When the M option is invalid, not influence of Mask register pair dot product operation.64 in Mask register, every byte of indicating vector registor respectively.The value of scalar register and Sticky position compensation Comp do not receive the influence of Mask register.Every octet is 1 independently unit to data in sign bit expanding element 32, the 512 bit vector registers through getting into after the shielding of Mask register, carries out the expansion of 8 bit sign positions.When operating (the U option is effective), directly mend 80 in a high position for no symbol fixed point dot product; When the operation of symbol fixed point dot product operation (the U option is invalid) or floating-point dot product is arranged, compensate 8 bit sign positions in a high position.After the sign bit expansion; Vector mantissa, scalar mantissa, Comp compensating unit get into restructural compressor reducer network 33 together; Restructural compressor reducer network 33 is supported 8/16/32 to be had/the data without sign compression, and with 16/32/64 32/16/8 bit data, 32 scalar datas and Comp compensating unit boil down to " and string " S and " carry string " C.

As shown in Figure 7, Fig. 7 is the restructural compressor reducer part in the network that adds up of the dot product according to the embodiment of the invention, the synoptic diagram of 8/16/32 fixed point, 32 floating-point restructural compressor reducer networks.512 bit vector mantissa get into first order compressor reducer array after the sign bit expansion, after 3 layers of 32 compressor compresses, obtain 2 compression result.When being operated in floating point mode, 2 outputs of the 3rd layer of 32 compressor reducer, the value of scalar register and Sticky position compensation Comp get into a 4-2 compressor reducer, obtain floating-point coefficient's compression result; When being operated in 32 fixed point patterns, the Sticky position compensation Comp of 4-2 compressor reducer is 0, obtains 32 fixed point compression result.When being 16 fixed point formats, behind 3 layers of 32 compressor reducer, compress together with the value of scalar register through 16 compressor reducers and a 3-2 compressor reducer successively, obtain 16 fixed point compression result.8 forms that fix are similar with 16 fixed point formats.

As shown in Figure 8, Fig. 8 is the restructural compressor reducer part in the network that adds up of the dot product according to the embodiment of the invention, the synoptic diagram of 32 seat compressor reducers.This figure top is divided into the compressor reducer part, and the bottom is divided into the sign bit expansion.32 compressor reducers are composed in series by 48 compressor reducers and MUX, select the carry or 0 of low digit selector to get into high-order compressor reducer according to different data layout MUX.If be operated in 8 bit patterns, then 3 MUX select 0 input respectively; If be operated in 16 bit patterns, then the 1st, 3 MUX selects the low level carry, and the 2nd MUX selects 0; If be operated in 32 bit patterns, then 3 MUX select the low level carry.48 compressor reducers of sign bit expansion link to each other with 48 compressor reducers of compressor reducer part respectively, and receive the input of escape character position, and arithmetic is of equal value when guaranteeing the different grain size data compression.

As shown in Figure 1; Said floating-point index, mantissa's aftertreatment/fixed-point number operation part 4 comprise mantissa's addition unit 41, leading 0 prediction PZD 42, floating-point coefficient normalization shift unit 43, floating-point coefficient's normalization round off unit 44, index amending unit 45, sign bit correction 46 and fixed point result treatment 47; S after 41 pairs of compressions of said mantissa addition unit, C string carry out addition, obtain mantissa's addition result; S, C string that said leading 0 42 couples of prediction PZD get into mantissa's addition unit 41 carry out 0,1 precoding; This 0,1 coded strings is handled through leading 0 decision circuitry and is obtained leading 1 position; The distance that control mantissa result of calculation is shifted when normalization shift; Because precoding may produce 1 bit error, the result after the displacement also will pass through compensating circuit and judge and rectification error; The translocation distance that said floating-point coefficient normalization shift unit 43 adopts leading 0 prediction PZD 42 to obtain is shifted to resultant mantissa, obtains normalized result of floating-point coefficient; The said floating-point coefficient normalization unit 44 that rounds off rounds off to normalized mantissa according to Guard, Round, Sticky position; Said index amending unit 45 carries out index adjustment and sign bit correction with symbol amending unit 46 according to the output of PZD, situation and the mantissa's addition result that normalization is rounded off, and obtains final floating-point exponential sum sign bit; Said fixed point result treatment unit 47 is according to fixed point instruction option, and the result of mantissa's addition unit 41 is handled, and obtains fixed point vector dot product accumulation result.

Floating-point index, mantissa's aftertreatment/fixed-point operation fractional part 4 utilize quick compound totalizer to calculate the value of S+C and

respectively.When being operated in fixed point format, fixed point result treatment 47 is selected S+C result, and carries out aftertreatment, obtains vector fixed point dot product accumulation result.When being operated in floating-point format; According to the outcome symbol position selection S+C of S+C or the value of

, and control character position amending unit 46 is accomplished the correction of floating-point-sign positions.When the sign bit of S+C for negative, select

Value and to S _MaxNegate is as the sign bit of net result; Otherwise select the value of S+C, and the sign bit of net result and S _MaxThe same.S, C string that leading 0 42 couples of prediction PZD get into mantissa's addition unit 41 carry out the scale-of-two precoding, and this binary string is handled through leading 0 decision circuitry and obtained leading 1 position and normalization shift distance B _Normal, control floating-point coefficient's normalization shift unit 43 denormalization left shift or the figure place that moves to right; Because precoding may produce a bit error, the result after the displacement also will pass through compensating circuit and judge and rectification error.After normalization shift was accomplished, floating-point coefficient's normalization unit 44 that rounds off carried out the Guard/Round/Sticky position according to shift result and judges that completion is rounded off, and judged when rounding off whether the result of mantissa need carry out secondary round off (SecondRound); Cause that when rounding off floating-point coefficient's most significant digit needs secondary to round off when producing carry, secondary rounds off can influence the result exponent of floating-point.Index amending unit 45 is according to floating-point maximal index E _Max, the normalization shift distance B _NormalWhether secondary round off (SecondRound) accomplish the adjustment of floating point result index.

As shown in Figure 9, Fig. 9 is add up floating-point index in the network, mantissa's aftertreatment/fixed-point operation fractional part of the dot product according to the embodiment of the invention, and the synoptic diagram of row index amending unit, and it is with the 43 parallel completion of floating-point coefficient normalization shift unit.Adopt two 8 totalizers to calculate E respectively _Max+ D _NormalAnd E _Max+ D _Normal+ 1 value is selected final index result through the secondary steering logic (SecondRound) that rounds off then.When the needs secondary rounded off, mantissa also need move to left one after the completion of once rounding off, and floating point result mantissa is E _Max+ D _Normal+ 1.Adopt the method for the parallel adjustment of index, originally two 8 totalizers are reduced to one on the index adjustment path, have reduced the time-delay of critical path, have improved the performance of total system.

Based on Fig. 1 to the configurable dot product of the reconfigurable vector length of the support fixed and floating shown in Figure 9 network that adds up; The present invention also provides a kind of fixed and floating restructural, the configurable data of data length summation method; This method comprises: 8/16/32 fixed-point data restructural, and 8 fixed-point datas are elementary cell; 28 fixed-point datas and corresponding control logic group are reconstructed into 16 fixed-point datas; 48 fixed-point datas and corresponding control logic are reconstructed into 32 fixed-point datas; The fixed and floating restructural, floating-point coefficient displacement alignment back determines whether to accomplish complementary operation according to the sign bit situation; When sign bit is 1, floating-point coefficient negates, and it is 1 constant that sign bit keeps, and floating-point-sign position, mantissa form 32 new bit data; When sign bit is 0, floating-point coefficient remains unchanged.Floating-point coefficient can multiplexing fixed-point data path after sign bit is handled; Data length is configurable, realizes through the Mask register; Each of Mask register is certain bit field of control vector data respectively, disposes the data length of participating in computing through the value that disposes the Mask register.

Above-described specific embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain, and institute it should be understood that the above is merely specific embodiment of the present invention; Be not limited to the present invention; All within spirit of the present invention and principle, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. support the configurable dot product of the reconfigurable vector length of the fixed and floating network that adds up for one kind, it is characterized in that, comprising:

Parallel restructural multiplier (1) is used to receive vector data B, C and data options FBS, U as input, and the execute vector multiply operation obtains multiplication result B * C of vector data B, C, and exports to floating-point index, mantissa's preprocessing part (2);

Floating-point index, mantissa's preprocessing part (2); Multiplication result B * the C and the scalar data A that are used to receive parallel restructural multiplier (1) are as input; Accomplish to select floating-point index maximal value, that index is asked is poor, displacement alignment, complement code are changed and sticky position compensating operation; Vector result B * C after obtaining handling and scalar result A, and export to restructural compressor reducer part (3);

Restructural compressor reducer part (3) is used to receive the result of floating-point index, mantissa's preprocessing part (2), and it is compressed, obtain " and string " (S) with " carry string " (C), and export to floating-point index, mantissa's aftertreatment/fixed-point operation part (4); And

Floating-point index, mantissa's aftertreatment/fixed-point operation part (4); Be used for " and string " S of being received from restructural compressor reducer part (3) and " carry string " C end of line of going forward side by side is counted addition, the result of mantissa of addition carried out aftertreatment obtain final dot product accumulation result.

2. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 1 network that adds up; It is characterized in that said parallel restructural multiplier (1) adopts 16 identical 32 restructural multipliers (11) group structure to form, and supports 8/16/32 fixed-point multiplication; 32 IEEE754 standards are simplified the single-precision floating point multiplication; 16 32 multiplier concurrent workings obtain 16 * 32=512 position multiplication result, reach 16 * 32 handling capacity.

3. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 2 network that adds up; It is characterized in that; Said parallel restructural multiplier (1) adopts 16 identical 32 restructural multipliers (11) group structure to form, and it is specifically organized the structure process and comprises:

AH/BH, AL/BL are respectively high and low 8 of 16 bit data A/B, at first 48 * 8 multiplier concurrent workings, and each 8 * 8 multiplier obtains 2 compression result through Wallace's compressed tree; Get into 1 8-2 compressor reducer after 8 of 4 multipliers compression result weights are alignd then together and obtain final " and string " S and " carry string " C; S and C obtain the high 24 of 16 * 16 multipliers through 24 totalizers, and directly from the least-significant byte of AL * BL multiplier, the least-significant byte carry of AL * BL multiplier output simultaneously is as the carry input of 24 multipliers for the other least-significant byte of 16 * 16 multipliers; At last, through 1 selector switch, when work is 8 * 8 patterns, 16 results of selector switch output AL * BL and AH * BL multiplier; When being operated in 16 * 16 patterns, 32 results of 16 * 16 multipliers that the result of 24 totalizers of selector switch output and the least-significant byte result of AL * BL constitute.

4. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 1 network that adds up; It is characterized in that; Said floating-point index, mantissa's operation part (2) comprise that index cascade comparer (21), index are asked poor array (22), be shifted alignment unit (23), complement code converting unit (24) and sticky position compensating unit (25), wherein:

Said index cascade comparer (21) is used to obtain 17 floating-point index maximal value E _Max

Said index is asked poor array (22), is used to obtain each floating-point index E _iWith floating-point index maximal value E _MaxDifference E _Max-E _i, this difference E _Max-E _iThe translocation distance of alignment unit (23) is used to be shifted;

Said displacement alignment unit (23) adopts index to ask the output E of poor array (22) _Max-E _iAs control signal, carry out right-shift operation, floating-point coefficient is alignd;

Said complement code converting unit (24) is used for floating-point coefficient after the specific displacement is carried out complementary operation, and the mantissa that need negate is sign bit S _iWith maximum floating-point exponent sign position S _MaxDifferent floating-point coefficients;

Said sticky position compensating unit (25), be used for to the floating-point coefficient that shifts out with need the floating-point coefficient of supplement sign indicating number to carry out a compensation, obtain 17 compensating units.

5. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 4 network that adds up; It is characterized in that; Said floating-point index, mantissa's operation part (2) are obtained 16 floating-point multiplication results, and are isolated the index E of floating-point successively from parallel restructural multiplier part (1) ₀-E ₁₅, the M of mantissa ₀-M ₁₅With sign bit S ₀-S ₁₅16 floating-point multiplication result exponent E ₀-E ₁₅With the floating-point index E in the scalar register ₁₆Get into cascade index comparator (21) and obtain maximum floating-point index E _Max, ask poor array (22) to obtain a floating-point index E successively through index then _iWith index maximal value E _MaxDifference DELTA E _i, it is 17 8 parallel subtracters that index is asked poor array; Displacement alignment unit (23) adopts 17 32 bit parallel shift units to carry out simultaneously, and the control signal of each shift unit is asked the output Δ E of poor array (22) from index _i, shift unit export to complement code converting unit (24), when mantissa needs supplement

The time need to the displacement after floating-point coefficient carry out complementary operation; Δ E to shifting out simultaneously _iPosition mantissa carries out the Sticky compensation, when mantissa needs supplement

Contain in the binary digit that perhaps shifts out at 1 o'clock, need carry out the Sticky compensation, i.e. Sticky _i=1; Sticky position compensating unit (25) counts Sticky ₀-Sticky ₁₆In need carry out Sticky position compensation number Comp.

6. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 1 network that adds up; It is characterized in that; Said restructural compressor reducer part (3) comprises Mask screen unit (31), sign bit expanding element (32) and restructural compressor reducer network (33); Wherein: Mask screen unit (31); Be used to receive the output of floating-point index, mantissa's preprocessing part (2), and analyze the value of Mask register, Mask register controlled vector registor is participated in the number that dot product adds up and operates; Data get into sign bit expanding element (32) after the shielding of Mask register, every octet is 1 independently unit in the 512 bit vector registers of sign bit expanding element (32), carry out the expansion of 8 bit sign positions; After the sign bit expansion; Vector data, scalar data, Comp compensating unit get into restructural compressor reducer network (33) together; Restructural compressor reducer network (33) is supported 8/16/32 to be had/the data without sign compression, and with 16/32/64 32/16/8 bit data and sign-extension bit, 32 scalar datas and sign-extension bit and Comp compensating unit boil down to " and string " S and " carry string " C.

7. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 1 network that adds up; It is characterized in that; " and string " S that said floating-point index, mantissa's aftertreatment/fixed-point operation part (4) are received from restructural compressor reducer part (3) carries out mantissa's addition with " carry string " C; For fixed point format, directly carry out result treatment and obtain fixed point vector dot product accumulation result; For floating-point format, carry out that leading 1 judgement, normalization shift, normalization are rounded off, index is adjusted and symbol adjustment operation, finally obtain floating point vector dot product accumulation result.

8. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 7 network that adds up; It is characterized in that; Said floating-point index, mantissa's aftertreatment/fixed-point number operation part (4) comprise mantissa's addition unit (41), leading 0 prediction PZD (42), floating-point coefficient normalization shift unit (43), floating-point coefficient's normalization round off unit (44), index amending unit (45), sign bit correction (46) and the result treatment (47) of fixing a point, wherein:

Said mantissa addition unit (41) is used for the S after the compression, C string are carried out addition, obtains mantissa's addition result;

Said leading 0 prediction PZD (42); Be used for the S, the C string that get into mantissa's addition unit (41) are carried out 0,1 precoding; This 0,1 coded strings is handled through leading 0 decision circuitry and is obtained leading 1 position; The distance that control mantissa result of calculation is shifted when normalization shift, because precoding may produce 1 bit error, the result after the displacement also will pass through compensating circuit and judge and rectification error;

Said floating-point coefficient normalization shift unit (43), the translocation distance that is used to adopt leading 0 prediction PZD (42) to obtain is shifted to resultant mantissa, obtains normalized result of floating-point coefficient;

The said floating-point coefficient normalization unit (44) that rounds off is used for according to Guard, Round, Sticky position normalized mantissa being rounded off;

Said index amending unit (45) and symbol amending unit (46), the output, normalization situation about rounding off and the mantissa's addition result that are used for according to PZD are carried out index adjustment and sign bit correction, obtain final floating-point exponential sum sign bit;

Said fixed point result treatment unit (47) is used for according to fixed point instruction option, and the result of mantissa's addition unit (41) is handled, and obtains fixed point vector dot product accumulation result.

9. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 8 network that adds up is characterized in that, said floating-point index, mantissa's aftertreatment/fixed-point operation part (4) utilize quick compound totalizer calculate respectively S+C with

Value, wherein: when being operated in fixed point format, fixed point result treatment (47) is selected S+C result, and carries out aftertreatment, obtains fixed point vector dot product accumulation result; When being operated in floating-point format, according to the outcome symbol position of S+C select S+C or

Value, and control character position amending unit (46) is accomplished the floating-point-sign position and is revised; When the sign bit of S+C for negative, select

Value and to S _MaxNegate is as the sign bit of net result; Otherwise select the value of S+C and the sign bit and the S of net result _MaxIdentical.

10. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 8 network that adds up; It is characterized in that; Said leading 0 prediction PZD (42) carries out the scale-of-two precoding to S, the C string that gets into mantissa's addition unit (41), and this binary string is handled through leading 0 decision circuitry and obtained leading 1 position and normalization shift distance B _Normal, control floating-point coefficient's normalization shift unit (43) denormalization left shift or the figure place that moves to right.

The network 11. the configurable dot product of the reconfigurable vector length of support fixed and floating according to claim 8 adds up; It is characterized in that; After normalization shift is accomplished; The floating-point coefficient normalization unit (44) that rounds off carries out the Guard/Round/Sticky position according to shift result and judges that completion is rounded off, and judges when rounding off whether the result of mantissa need carry out secondary round off (SecondRound); Cause that when rounding off floating-point coefficient's most significant digit needs secondary to round off when producing carry, secondary rounds off can influence the result exponent of floating-point.

The network 12. the configurable dot product of the reconfigurable vector length of support fixed and floating according to claim 8 adds up is characterized in that said index amending unit (45) is according to floating-point maximal index E _Max, the normalization shift distance B _NormalWhether secondary round off (SecondRound) accomplish the adjustment of floating point result index.

The network 13. the configurable dot product of the reconfigurable vector length of support fixed and floating according to claim 12 adds up is characterized in that, said index amending unit (45) adopts two 8 totalizers to calculate E respectively _Max+ D _NormalAnd E _Max+ D _Normal+ 1 value is selected final index result through the secondary steering logic (SecondRound) that rounds off then; When the needs secondary rounded off, mantissa also need move to left one after the completion of once rounding off, and floating point result mantissa is E _Max+ D _Normal+ 1.