CN105335127A

CN105335127A - Scalar operation unit structure supporting floating-point division method in GPDSP

Info

Publication number: CN105335127A
Application number: CN201510718454.7A
Authority: CN
Inventors: 彭元喜; 雷元武; 彭浩; 陈书明; 郭阳; 刘祥远; 田甜; 徐恩; 胡封林; 刘仲; 孙永节; 陈虎; 刘胜; 王耀华; 吴虎成
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2016-02-17

Abstract

The invention discloses a scalar operation unit structure supporting a floating-point division method in a GPDSP. The scalar operation unit structure comprises a first component SMAC1, a second component SMAC2 and a third component SIEU, which are used as scalar calculation components and are used for supporting the scalar basic calculation; each scalar calculation component corresponds to one scalar instruction in a VLIW execute packet. The scalar operation unit structure has the advantages that the instruction execution periods are few; the delay is small; the structure is simple; the feasibility is high, and the like.

Description

The scalar operation cellular construction of floating-point division is supported in GPDSP

Technical field

The present invention is mainly concerned with field of microprocessors, refers in particular to one and is applicable in high performance universal DSP (GPDSP) chip, support that the scalar operation unit of floating-point division realizes structure.

Background technology

The develop rapidly of the digital service driven along with internet, mobile communication, consumer electronics, multimedia technology, people need more powerful digital signal processor, process huge data service.Such as high definition 2D or 3D Digital Image Processing, Radar Signal Processing, independent navigation information processing, mobile communication etc.Because these algorithms all have data operation intensity, relate to the computings such as a large amount of floating-point, fixed point, logic, plural basic operation and division.Especially division, the performance of single-precision floating point division or double-precision floating point division arithmetic, by the generation considerable influence to whole processor overall performance, will become the performance bottleneck in some application.

At present, a high performance universal DSP (GPDSP) that directly can support floating-point divide instructions is not had.The general floating-point series DSP of such as TI directly can not realize floating-point divide instructions, and hardware obtains approximate value reciprocal by the method for look-up table, then calls correlator program by Newton iteration mode and realizes division arithmetic.This implementation area is less, but cannot be obtained the floating-point division result of IEEE-754 standard by alternative manner, and relative to direct hardware implementing, the method iterative computation time is longer.

Because division hardware implementation algorithm complexity is high, project organization is complicated, area occupied is comparatively large, generally directly do not design divide block at the vector unit that concurrency is larger.Therefore, a kind ofly support that the scalar operation Unit Design of floating-point division has great importance.

Summary of the invention

The technical problem to be solved in the present invention is just: the technical matters existed for prior art, the invention provides the scalar operation cellular construction that a kind of instruction execution cycle is few, postpone to support in little, that structure is simple, feasibility is good GPDSP floating-point division.

For solving the problems of the technologies described above, the present invention by the following technical solutions:

Support a scalar operation cellular construction for floating-point division in GPDSP, it comprises first component SMAC1, second component SMAC2 as scalar operation parts and the 3rd parts SIEU, for supporting scalar basic operations; The corresponding VLIW of each described scalar operation parts performs a scalar instruction in bag.

As a further improvement on the present invention: also comprise scalar register file, for reading and the written-back operation of data; When receiving the scalar instruction distributing parts and distribute, judge it is belong to which scalar operation parts after decoding, the source operand address of correspondence and read request are delivered to scalar register file simultaneously, until instruction useful signal deliver to scalar operation parts after, the data obtained from scalar register file will be obtained, start to perform computing, finally result is write back scalar register file.

As a further improvement on the present invention: described first component SMAC1 and second component SMAC2 is isomorphism MAC arithmetic unit; Described MAC arithmetic unit comprises floating point multiplication addition unit FMAC, fixed point multiplicaton addition unit IMAC, floating-point arithmetic logical block FALU, floating-point division unit F DIV; Above-mentioned each functional unit is the separate unit having identical data path, and same period can only have a functional part to perform effective instruction, and after executing, result selects logic by afterbody, exports corresponding destination address to.

As a further improvement on the present invention: described floating point multiplication addition unit FMAC is used for processing multicycle complicated floating-point operation, and adopt dynamic pipeline structure, each cycle can flow out an instruction, and each flowing water of same clock period station can perform different operations.

As a further improvement on the present invention: described floating point multiplication addition unit FMAC adopts double precision to be separated FMAC structure with single precision to rank shifting function to rank shifting function, comprising: the normalization processing module S that operand preparation module R, mantissa multiplier module X, double precision multiply-add operation path Y, single precision multiply-add operation path Z, single double precision path are multiplexing; Described operand preparation module R, according to instruction, completes floating-point single precision, the symbol of double precision operand, index, the separation of mantissa and the exception of input operand according to IEEE-754 standard and judges; The single precision multiplication result mantissa that described mantissa multiplier module X is responsible for all instructions calculates; Described double precision multiply-add operation path Y has been used for the afterbody CSA4:2 Partial product compressions that index jump calculates and 161 of double precision operand C calculate rank displacement, double-precision result mantissa of double-precision operation; Described single precision multiply-add operation path Z be used for SIMD take advantage of add, SIMD take advantage of subtract, SIMD multiplication and dot product, complex multiplication operations index jump calculate, mantissa exchange and mantissa exchange after to rank; The multiplexing normalization processing module S of described single double precision path has been used for that the resultant mantissa after to rank displacement calculates, normalization process and index correction operation.

As a further improvement on the present invention: described fixed point multiplicaton addition unit IMAC is used for performing fixed point multiply accumulating; Realization fixed point and floating point multiplication addition, take advantage of and subtract instruction in, two operands inputting multiplier are 64 floating datas, 3-operand be 53 floating-point operation number, result is the floating-point operation number of 64; And perform fixed point when taking advantage of plus-minus instruction, two operands of multiplier are 32 symbol/signless operand, the 3rd operand be one 64 have symbol/without symbolic operand, result be one 64 have symbol/signless destination operand.

As a further improvement on the present invention: described floating-point arithmetic logical block FALU comprises floating-point FALU short period instruction module, floating-point ALU conversion instruction module and floating-point ALU plus-minus method instruction module; Described floating-point FALU short period instruction module comprises all monocyclic floating-point arithmetic logical orders, comprise being greater than of list/double precision, be less than and comparison of equalization instruction, ask the instruction of the index of list/double precision, mantissa and absolute value, calculate the instruction that the instruction of list/double precision Reciprocals sums inverse square root and single-precision floating point convert double-precision floating points to.

As a further improvement on the present invention: described 3rd parts SIEU comprises bit processing unit BP and fixed point arithmetic logical block IALU, and both are the separate units with identical data path, same period can only have a functional unit to perform effective instruction; After executing, result selects logic by afterbody, exports corresponding destination address to; Meanwhile, saturated and unsaturation situation when performing according to instruction, also can produce asserts signal, indicate this condition execution instruction.

As a further improvement on the present invention: described BP unit comprises three functional units, be 64 bit shift device unit shifter, bit processing unit Bitp and packing unwrapper unit PK respectively; The decoded signal of standing out from decoding and the source operand coming from register are received by three functional units, and starting computing immediately, final Output rusults unpacks selection and bit processing unit result the operation result of functional unit according to selection signal from displacement and packing and exports; If packing unpacks instruction, and need to sentence saturated, saturated mark can export to status register while result exports.

Compared with prior art, the invention has the advantages that:

1, support the scalar operation cellular construction of floating-point division in GPDSP of the present invention, based on the floating-point divide instructions hardware implementing structure of SRT-8 algorithm, have instruction execution cycle few, postpone little, structure is simple, the feature that feasibility is good.Meanwhile, whole SPE structure can provide logical multiplexing to design, and better meets the portability of design and the feature of area controllability.

2, supporting the scalar operation cellular construction of floating-point division in GPDSP of the present invention, is hybrid operation unit, can realize 64 fixed points, 32 fixed points, double-precision floating point and single-precision floating point related operations, complete function.

3, support the scalar operation cellular construction of floating-point division in GPDSP of the present invention, the executed in parallel of three streamlines can be realized, be applicable to realizing within a processor; Floating-point operation part supports floating point multiplication addition, multiplication, addition, also has the related operations such as plural number, dot product, can meet the requirement of various application occasions.

4, support the scalar operation cellular construction of floating-point division in GPDSP of the present invention, multiplexing critical component 64*64 multiplier, can realize fixed point and floating-point mixing multiplication on the same hardware platform, area overhead is little.The present invention supports to design based on the division of SRT-8 algorithm, can realize 64 have symbol without symbol fixed point integer division, 32 have symbol without the instruction of symbol fixed point integer division, double-precision floating point division and single-precision floating point division.

Accompanying drawing explanation

Fig. 1 is scalar operation unit (SPE) of the present invention position view within a processor.

Fig. 2 is the topological structure schematic diagram of scalar operation unit (SPE) of the present invention.

Fig. 3 is the data path architecture schematic diagram of scalar operation unit (SPE) of the present invention.

Fig. 4 is the topological structure schematic diagram of the present invention's SMAC parts in embody rule example.

Fig. 5 is the structural representation of the present invention SMAC parts subelement FMAC in embody rule example.

Fig. 6 is the structural representation of the present invention SMAC parts subelement IMAC in embody rule example.

Fig. 7 is the structural representation of the present invention SMAC parts subelement FALU in embody rule example.

Fig. 8 is the structural representation of the present invention SMAC parts subelement FDIV in embody rule example.

Fig. 9 is the structural representation of the present invention's SMAC part reusing 64*64 multiplier in embody rule example.

Figure 10 is the topological structure schematic diagram of the present invention's IEU parts in embody rule example.

Figure 11 is the structural representation of the present invention SIEU parts subelement BP in embody rule example.

Figure 12 is the structural representation of the present invention SIEU parts subelement IALU in embody rule example.

Embodiment

Below with reference to Figure of description and specific embodiment, the present invention is described in further details.

As shown in Figure 1, be scalar operation unit (SPE) position within a processor of the present invention.Scalar operation cell S PE of the present invention is arranged in the scalar processing unit SC of processor, by receiving in instruction flow control unit the scalar operation class instruction distributing parts and distribute, delivering to function arithmetic element corresponding in SPE and performing after decoding.Also comprise scalar data memory access unit in scalar processing unit SC, it can realize the memory access flowing water station controls such as the decoding of scalar access instruction, address computation and data write back, and can also provide Data support for SPE; SPE provides the related operation such as address process, digital independent also can to scalar memory access unit simultaneously.

As shown in Figure 2, be the topological structure schematic diagram of scalar operation cell S PE of the present invention.Inner integrated three arithmetic units of SPE are first component SMAC1, second component SMAC2 and the 3rd parts SIEU, for supporting scalar basic operations respectively.Each scalar operation parts, corresponding VLIW performs a scalar instruction in bag, namely SPE comprise three can the streamline of executed in parallel.Meanwhile, in SPE, also comprise a scalar register file, for reading and the written-back operation of data.When SPE receives the scalar instruction distributing parts and distribute, judge it is belong to which arithmetic unit after decoding, the source operand address of correspondence and read request are delivered to scalar register file simultaneously, until instruction useful signal deliver to arithmetic unit after, the data obtained from scalar register file will be obtained, start to perform computing, finally result is write back scalar register file.Scalar register file in processor is arranged in scalar operation cell S PE, it can provide independently reading-writing port for first component SMAC1, second component SMAC2 and these three functional parts of the 3rd parts SIEU, to ensure that each functional part meets it and realizes instruction action required and keep count of, as first component SMAC1, second component SMAC2 have 3 read ports and 1 write port respectively, 3rd parts SIEU has 2 read ports, 1 write port.

As shown in Figure 3 and Figure 4, first component SMAC1, second component SMAC2 are isomorphism MAC arithmetic unit, each MAC arithmetic unit comprises four independently functional units, is respectively: floating point multiplication addition unit FMAC, fixed point multiplicaton addition unit IMAC, floating-point arithmetic logical block FALU, floating-point division unit F DIV.Wherein, floating point multiplication addition unit FMAC and fixed point multiplicaton addition unit IMAC is a multiplexing 64*64 multiplier, makes the area of whole processor reduce to some extent.Floating point multiplication addition unit FMAC, fixed point multiplicaton addition unit IMAC, these four functional units of floating-point arithmetic logical block FALU, floating-point division unit F DIV are the separate units having identical data path, same period can only have a functional part to perform effective instruction, after executing, result selects logic by afterbody, exports corresponding destination address to; Namely same period can not start to perform or write back simultaneously, but can by software flow schedule parallel; Its operand is originated the data be mainly in scalar register file, can also from immediate Imm.Except floating-point arithmetic logical block FALU only has two operand instruction, other three unit all support 3-operand instruction.3rd parts SIEU comprises bit processing unit BP and fixed point arithmetic logical block IALU, and both are the separate units with identical data path, and both same periods can not start to perform or write back simultaneously, can be dispatched realize walking abreast by software flow; Its operand is originated the data be mainly in scalar register file, can also share register SVR, vector location Partial controll register VULCR, both only have two operand instruction from immediate Imm, mark vector.

As shown in Figure 5, floating point multiplication addition unit FMAC is the functional part processing multicycle complicated floating-point operation in processor calculating unit.These parts can realize 4 class floating point instructions: multiplying order, take advantage of add instruction, take advantage of after add instruction, add instruction.It adopts dynamic pipeline structure, and each cycle can flow out an instruction, and each flowing water of same clock period station can perform different operations.

In the present embodiment, floating point multiplication addition unit FMAC adopts double precision to be separated FMAC structure with single precision to rank shifting function to rank shifting function, the i.e. FMAC structure of path separation, object is set out to simplify hardware implementing algorithm and reducing the large bit wide data register between standing.Its general structure is made up of five parts: operand preparation module R, mantissa multiplier module X, double precision multiply-add operation path Y, single precision multiply-add operation path Z, the normalization processing module S that single double precision path is multiplexing.Wherein, operand preparation module R, according to instruction, completes floating-point single precision, the symbol of double precision operand, index, the separation of mantissa and the exception of input operand according to IEEE-754 standard and judges.The single precision multiplication result mantissa that mantissa multiplier module X is responsible for all instructions calculates.Double precision path Y completes the index jump calculating of double-precision operation and 161 afterbody CSA4:2 Partial product compressions calculated rank displacement, double-precision result mantissa of double precision operand C.Single precision path Z completes SIMD and takes advantage of and add, and SIMD takes advantage of and subtracts, the index jump of SIMD multiplication and dot product, complex multiplication operations calculates, mantissa exchanges and mantissa exchange after to rank.S module completes the operations such as the resultant mantissa calculating after to rank displacement, normalization process and index correction.

As shown in Figure 6, fixed point multiplicaton addition unit IMAC is the functional unit performing fixed point multiply accumulating in operation processing unit, can perform fixed point signed magnitude arithmetic(al), multiplying, multiply-add operation, take advantage of and subtract computing, the computing of MOV class.A large amount of fixed point plus-minus method and MOV computing is all there is in fixed-point algorithm operation and control process, object fixed point plus-minus method and MOV being integrated into fixed point MAC unit is the instruction slots in order to increase fixed point plus-minus method and MOV class in VLIW instruction, improves arithmetic speed.Realization fixed point and floating point multiplication addition, take advantage of and subtract instruction in, two operands inputting multiplier are 64 floating datas, 3-operand be 53 floating-point operation number, result is the floating-point operation number of 64; And fixed point takes advantage of the realization of plus-minus instruction, two operands of multiplier are 32 symbol/signless operand, the 3rd operand be one 64 have symbol/without symbolic operand, result be one 64 have symbol/signless destination operand.Also support the related operation of fixed point point sum plural number.

As shown in Figure 7, in the structure of floating-point arithmetic logical block FALU, according to practical function and the difference of instruction cycle, divide into three sub-execution modules: floating-point FALU short period instruction module, floating-point ALU conversion instruction module and floating-point ALU plus-minus method instruction module; Wherein, floating-point FALU short period instruction module comprises all monocyclic floating-point arithmetic logical orders, comprise being greater than of list/double precision, be less than and comparison of equalization instruction, ask the instruction of the index of list/double precision, mantissa and absolute value, calculate the instruction that the instruction of list/double precision Reciprocals sums inverse square root and single-precision floating point convert double-precision floating points to; Floating-point ALU conversion instruction module achieves between 2 cycle floating-points and fixed point and type conversion instructions between list/double-precision floating point, comprise that list/double-precision floating points is converted to integer instructions, list/double-precision floating points block convert integer to, (have symbol or without symbol) integer is converted to list/double-precision floating points, double-precision floating point converts single precision floating datum instruction to; Floating-point ALU plus-minus method instruction module achieves the list/double-precision floating point plus-minus method instruction in 4 cycles.

As shown in Figure 8, be the floating-point division unit F DIV in the present embodiment, wherein follow operand A and B of IEEE-754 floating-point format for two, perform divide operations between them, its computing can be divided into the following steps:

Whether be exception data, and arrange result data if S1. detecting operand.

S2. the sign bit of result of calculation value: two sign bit XORs.

S3. the exponential part of result of calculation, two index exponents are subtracted each other.

S4. mantissa is divided by: mantissa's low level of divisor is increased by 0 and makes it to increase to original one times to the mantissa's figure place expanding divisor with this, obtain the figure place result of accuracy limitations.

S5. result normalization: after mantissa is divided by, may need to move to left, reduce index simultaneously, and according to rounding mode, carries out mantissa result adjustment.

S6. abnormality detection: the generation overflow and the underflow that floating-point division are defined to two kinds of exceptions in IEEE-754: if the greatest exponential value that result exponent allows beyond precision, returns overflow abnormal; If result exponent is also less than the minimal index value of precision defined, return underflow exception.

The structural design of this floating-point division unit F DIV is based on the SIMD structure floating-point divide instructions of SRT-8 algorithm.Described SRT-8 instruction is divided into 01,10,11 call instruction, and the selection signal namely shown in figure selects the type that instruction performs; It performs double-precision floating point 1 ~ 6 (the two single-precision floating point 1 ~ 3 of SIMD) secondary division iterations respectively, double-precision floating point 7 ~ 12 (the two single-precision floating point 4 ~ 6 of SIMD) secondary division iterations, double-precision floating point 13 ~ 18 (the two single-precision floating point 7 ~ 9 of SIMD) secondary division iterations.Finally according to remainder and business's result of SRT-8 instruction output, and the call number of SRT-8 instruction, business's result of normalization double-precision floating point division or the two single-precision floating point division nonidentity operation precision of SIMD.

This structure, based on SRT-8 algorithm, utilizes hardware resource multiplex technique and iteration cutting technique, Parallel Implementation double-precision floating point division on same hardware configuration, the two single-precision floating point division function of SIMD.

SPE structural support floating-point division of the present invention.Scalar floating-point division parts (SFDIV) are the functional units performing the computing of scalar floating-point division in SMAC parts, mainly achieve four instructions, be respectively based on the double-precision floating point division iterations instruction of SRT-8 division algorithm, double-precision floating point division standardizing order, based on the two single-precision floating point division iterations instruction of SIMD of SRT-8 division algorithm and the two single-precision floating point division standardizing order of SIMD.In GPDSP, instruction adopts 40 codings, and its SIMD floating-point divide instructions collection comprises two double-precision floating point divide instruction (FSRT8D and FNORMD) and two two single-precision floating point divide instruction (FSRT8S32 and FNORMS32) of SIMD.Its instruction description realized is as shown in table 1:

Table 1 divider instruction type and function

Instruction name	Beat number	Coding figure place	Command function	Scalar instruction
					SFSRT8D	7	40	Double-precision floating point division iterations	Be
SFNORMD	2	40	Double-precision floating point division is standardized	Be
					SFSRT8S32	4	40	The two single-precision floating point division iterations of SIMD	Be
SFNORMS32	2	40	The two single-precision floating point division normalization of SIMD	Be

As shown in Figure 9, be the 64*64 multiplier architecture schematic diagram of SMAC part reusing in the present invention.In SMAC structure, consider that 64x64 position multiplying order takies very large-area situation in the design, the multiplier in SMAC parts adopts logic module reuse plan, and fixed point/floating-point Multiplexing module main body is four 32x32 position multipliers.Start to perform multiply operation through operand process after data input, according to different instructions, need the instruction according to distributing, advanced line operate number is selected and Bits Expanding process, is input to respectively by the operand handled well in 4 32x32 multipliers and carries out multiplying.Result divides fixed point results and floating point result, and then different according to instruction, result writes back register or is sent to the next stop.

As shown in Figure 10, be the topological structure schematic diagram of the 3rd parts IEU in embody rule example.3rd parts SIEU comprises bit processing unit BP and fixed point arithmetic logical block IALU, and both are the separate units with identical data path, and same period can only have a functional unit to perform effective instruction.After executing, result selects logic by afterbody, exports corresponding destination address to.Meanwhile, saturated and unsaturation situation when performing according to instruction, also can produce asserts signal, indicate this condition execution instruction.That is: can not start both same period to perform or write back simultaneously, can be dispatched by software flow and realize walking abreast; Its operand is originated the data be mainly in scalar register file, can also share register SVR, vector location Partial controll register VULCR, both only have two operand instruction from immediate Imm, mark vector.

As shown in figure 11, be the topological structure schematic diagram of bit processing unit BP in embody rule example.BP unit comprises three functional units, is 64 bit shift device unit shifter, bit processing unit Bitp and packing unwrapper unit PK respectively.The decoded signal of standing out from decoding and the source operand coming from register are received by three functional units, and starting computing immediately, final Output rusults unpacks selection and bit processing unit result the operation result of functional unit according to selection signal from displacement and packing and exports.If packing unpacks instruction, and need to sentence saturated, saturated mark can export to status register while result exports.

As shown in figure 12, be the topological structure schematic diagram of fixed point arithmetic logical block IALU in embody rule example.Fixed point arithmetic logical block IALU contains 8 submodules, and because the time delay of plus-minus method operation is maximum, in the structural design of IALU, the multiplexing addition of discord comparing class instruction, adopts the adder structure be separated.Logic selecting sequence placing it in afterbody, in order to further reduce time delay, the operation of saturation add-minus method being separated, by plus-minus method instruction, realization combined by saturated instruction and relevant control register.

Below be only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be considered as protection scope of the present invention.

Claims

1. supporting a scalar operation cellular construction for floating-point division in GPDSP, it is characterized in that, comprising first component SMAC1, the second component SMAC2 as scalar operation parts and the 3rd parts SIEU, for supporting scalar basic operations; The corresponding VLIW of each described scalar operation parts performs a scalar instruction in bag.

2. support the scalar operation cellular construction of floating-point division in GPDSP according to claim 1, it is characterized in that, also comprise scalar register file, for reading and the written-back operation of data; When receiving the scalar instruction distributing parts and distribute, judge it is belong to which scalar operation parts after decoding, the source operand address of correspondence and read request are delivered to scalar register file simultaneously, until instruction useful signal deliver to scalar operation parts after, the data obtained from scalar register file will be obtained, start to perform computing, finally result is write back scalar register file.

3. support the scalar operation cellular construction of floating-point division in GPDSP according to claim 1, it is characterized in that, described first component SMAC1 and second component SMAC2 is isomorphism MAC arithmetic unit; Described MAC arithmetic unit comprises floating point multiplication addition unit FMAC, fixed point multiplicaton addition unit IMAC, floating-point arithmetic logical block FALU, floating-point division unit F DIV; Above-mentioned each functional unit is the separate unit having identical data path, and same period can only have a functional part to perform effective instruction, and after executing, result selects logic by afterbody, exports corresponding destination address to.

4. in GPDSP according to claim 3, support the scalar operation cellular construction of floating-point division, it is characterized in that, the multiplier of described floating point multiplication addition unit FMAC and fixed point multiplicaton addition unit IMAC adopts logic module reuse plan, and fixed point/floating-point Multiplexing module main body is four 32x32 position multipliers; Start to perform multiply operation through operand process after data input, according to different instructions, according to the instruction distributed, advanced line operate number is selected and Bits Expanding process, is input to respectively by the operand handled well in 4 32x32 multipliers and carries out multiplying; Result divides fixed point results and floating point result, and then different according to instruction, result writes back register or is sent to the next stop.

5. in GPDSP according to claim 3, support the scalar operation cellular construction of floating-point division, it is characterized in that, described floating point multiplication addition unit FMAC is used for processing multicycle complicated floating-point operation, adopt dynamic pipeline structure, each cycle can flow out an instruction, and each flowing water of same clock period station can perform different operations.

6. in GPDSP according to claim 5, support the scalar operation cellular construction of floating-point division, it is characterized in that, described floating point multiplication addition unit FMAC adopts double precision to be separated FMAC structure with single precision to rank shifting function to rank shifting function, comprising: the normalization processing module S that operand preparation module R, mantissa multiplier module X, double precision multiply-add operation path Y, single precision multiply-add operation path Z, single double precision path are multiplexing; Described operand preparation module R, according to instruction, completes floating-point single precision, the symbol of double precision operand, index, the separation of mantissa and the exception of input operand according to IEEE-754 standard and judges; The single precision multiplication result mantissa that described mantissa multiplier module X is responsible for all instructions calculates; Described double precision multiply-add operation path Y has been used for the afterbody CSA4:2 Partial product compressions that index jump calculates and 161 of double precision operand C calculate rank displacement, double-precision result mantissa of double-precision operation; Described single precision multiply-add operation path Z be used for SIMD take advantage of add, SIMD take advantage of subtract, SIMD multiplication and dot product, complex multiplication operations index jump calculate, mantissa exchange and mantissa exchange after to rank; The multiplexing normalization processing module S of described single double precision path has been used for that the resultant mantissa after to rank displacement calculates, normalization process and index correction operation.

7. support the scalar operation cellular construction of floating-point division in GPDSP according to claim 3, it is characterized in that, described fixed point multiplicaton addition unit IMAC is used for performing fixed point multiply accumulating; Realization fixed point and floating point multiplication addition, take advantage of and subtract instruction in, two operands inputting multiplier are 64 floating datas, 3-operand be 53 floating-point operation number, result is the floating-point operation number of 64; And perform fixed point when taking advantage of plus-minus instruction, two operands of multiplier are 32 symbol/signless operand, the 3rd operand be one 64 have symbol/without symbolic operand, result be one 64 have symbol/signless destination operand.

8. in GPDSP according to claim 3, support the scalar operation cellular construction of floating-point division, it is characterized in that, described floating-point arithmetic logical block FALU comprises floating-point FALU short period instruction module, floating-point ALU conversion instruction module and floating-point ALU plus-minus method instruction module; Described floating-point FALU short period instruction module comprises all monocyclic floating-point arithmetic logical orders, comprise being greater than of list/double precision, be less than and comparison of equalization instruction, ask the instruction of the index of list/double precision, mantissa and absolute value, calculate the instruction that the instruction of list/double precision Reciprocals sums inverse square root and single-precision floating point convert double-precision floating points to.

9. according to the scalar operation cellular construction supporting floating-point division in the GPDSP in claim 1 ~ 8 described in any one, it is characterized in that, described 3rd parts SIEU comprises bit processing unit BP and fixed point arithmetic logical block IALU, both are the separate units with identical data path, and same period can only have a functional unit to perform effective instruction; After executing, result selects logic by afterbody, exports corresponding destination address to; Meanwhile, saturated and unsaturation situation when performing according to instruction, also can produce asserts signal, indicate this condition execution instruction.

10. support the scalar operation cellular construction of floating-point division in GPDSP according to claim 9, it is characterized in that, described BP unit comprises three functional units, is 64 bit shift device unit shifter, bit processing unit Bitp and packing unwrapper unit PK respectively; The decoded signal of standing out from decoding and the source operand coming from register are received by three functional units, and starting computing immediately, final Output rusults unpacks selection and bit processing unit result the operation result of functional unit according to selection signal from displacement and packing and exports; If packing unpacks instruction, and need to sentence saturated, saturated mark can export to status register while result exports.