CN117251132B

CN117251132B - Fixed-floating point SIMD multiply-add instruction fusion processing device and method and processor

Info

Publication number: CN117251132B
Application number: CN202311211479.9A
Authority: CN
Inventors: 冯春阳; 马思杰
Original assignee: Hexin Technology Co ltd; Shanghai Hexin Digital Technology Co ltd
Current assignee: Hexin Technology Co ltd; Shanghai Hexin Digital Technology Co ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2024-05-14
Anticipated expiration: 2043-09-19
Also published as: CN117251132A

Abstract

The invention relates to the technical field of digital signal processing, in particular to a fixed-floating point SIMD multiply-add instruction fusion processing device, method and processor, which comprise a data splitting module, a data integrating module and a plurality of parallel operation modules; each operation module comprises a floating point operation module and a fixed point operation module which are connected; the data splitting module splits an input fixed floating point operand into a plurality of split data according to the operand type and bit width; the floating point operation module beats the floating point path result output after the multiply-add operation of the fixed point multiplexing floating point operand to the output selection pipeline stage of the fixed point operation module connected with the floating point path result; the fixed-point operation module outputs fixed-point operation results according to the fixed-point path results and the floating-point path results obtained by calculation, so that the result integration is carried out through the data integration module. The invention not only reduces the calculation resources of the fixed point operation module by multiplexing the floating point resources, thereby effectively reducing the area of the circuit, but also simplifying the control logic.

Description

Fixed-floating point SIMD multiply-add instruction fusion processing device and method and processor

Technical Field

The invention relates to the technical field of digital signal processing, in particular to a fixed-floating point SIMD multiply-add instruction fusion processing device, method and processor.

Background

Fixed-point multiplication and floating-point multiplication are basic operations in digital signal processing, and current processors typically separate operations of fixed-point instructions from operations of floating-point instructions, whereas separate processing requires two parts of hardware resources, which are relatively small for scalar instructions, but which require a large number of repeated resources for SIMD (Single Instruction Multiple Data, single instruction stream multiple data stream) instructions, and particularly for multiply-add instructions, the area occupied by adders and multipliers is relatively large, which makes the processor occupy a large area and increases power consumption, and therefore, it is highly desirable to provide a device capable of fusing fixed-floating-point instructions.

Disclosure of Invention

The invention provides a fixed-floating point SIMD multiply-add instruction fusion processing device, a method and a processor, which solve the technical problems that the existing processor separately computes fixed-point instructions and floating-point instructions, so that a large amount of repeated resources are increased, and the area and the power consumption of the processor are larger.

In order to solve the technical problems, the invention provides a fixed-floating point SIMD multiply-add instruction fusion processing device, a method and a processor.

In a first aspect, the present invention provides a fixed-floating point SIMD multiply-add instruction fusion processing apparatus, including: the data splitting module, the computing device and the data integrating module are sequentially connected; the operation device comprises a plurality of parallel operation modules, and each operation module comprises a floating point operation module and a fixed point operation module which are connected in a one-to-one correspondence manner;

the data splitting module is used for splitting an input fixed floating point operand into a plurality of split data according to the operand type and bit width, and distributing the split data to different operation modules; the split data comprises floating point operation operands with different bit widths and fixed point operands, wherein the floating point operation operands comprise floating point operands or fixed point multiplexing floating point operands;

The floating point operation module is used for carrying out operation on the floating point operation operands step by utilizing a preset floating point multi-stage pipeline structure to obtain a floating point operation result, and beating a floating point path result output after the fixed point multiplexing floating point operation result is subjected to multiply-add operation into an output selection pipeline stage of the fixed point operation module connected with the floating point operation module;

The fixed point operation module is used for operating the fixed point operand step by utilizing a preset fixed point multistage pipeline structure to obtain a fixed point passage result, and outputting a fixed point operation result according to the floating point passage result and the fixed point passage result; the pipeline structure stages of the fixed point operation module and the floating point operation module are the same;

the data integration module is used for integrating the operation results output by the floating point operation module or the fixed point operation module according to the bit width of the instruction set to obtain data integration results.

In a further embodiment, the data splitting module is specifically configured to:

Performing type detection on an input fixed-floating point operand, if the type of the fixed-floating point operand belongs to a preset resource multiplexing type, splitting the fixed-floating point operand into a plurality of fixed-point multiplexing floating point operands and/or fixed-point operands according to the corresponding bit widths and instruction set bit widths, respectively distributing the fixed-point multiplexing floating point operands and/or fixed-point operands to floating point operation modules and/or fixed-point operation modules in different operation modules, and converting the fixed-point multiplexing floating point operands into bit widths of floating point mantissas in the calculation process through the structure of the multiplexing floating point operation modules;

if the type of the fixed floating point operand belongs to a floating point type, splitting the fixed floating point operand into a plurality of floating point operands according to the corresponding bit width and instruction set bit width, and respectively distributing the plurality of floating point operands to floating point operation modules in different operation modules;

If the type of the fixed floating point operand belongs to the fixed point type, splitting the fixed floating point operand into a plurality of fixed point operands according to the corresponding bit width and instruction set bit width, and respectively distributing the fixed point operands to fixed point operation modules in different operation modules;

The resource multiplexing type is a type for converting part or all of fixed point numbers in fixed floating point operands into floating point numbers for calculation.

In a further embodiment, the resource multiplexing types at least include a fixed-point word type and a fixed-point halfword type, wherein the fixed-point word type instructions all multiplex the floating point operation module, and the fixed-point halfword type instructions partially multiplex the floating point operation module, so as to convert all or part of the fixed-point word type instructions and the fixed-point halfword type instructions into floating point numbers for multiply-add calculation.

In a further embodiment, the multi-stage pipeline structure of the fixed point operation module at least comprises a first fixed point multiplication unit, a second fixed point multiplication unit, a fixed point addition unit, an output selection unit and a result output unit which are arranged in different stages of pipeline structures; the fixed-point adding unit comprises a first adder and a second adder which are respectively connected with the second multiplier and the third multiplier;

The first fixed point multiplication unit is used for carrying out multiplication operation on the input fixed point operand to obtain a first multiplication operation result;

the second fixed-point multiplication unit is used for carrying out multiplication operation on the first multiplication operation result to obtain a second multiplication operation result;

The fixed point addition unit is used for carrying out addition operation on the second multiplication operation result to obtain a fixed point access result;

The output selection unit is used for responding to the input of a fixed-floating point selection signal, selecting the fixed-point path result and the floating point path result to obtain a selection result, and outputting a fixed-point operation result according to the selection result through the result output unit.

In a further embodiment, the floating point operation module includes a data decoding unit, a floating point multiplication unit, a floating point addition unit, a data normalization unit, and a rounding and status bit generation unit;

the data decoding unit is used for splitting the input floating point operand by sign bits, exponents and mantissas to obtain floating point split data;

The floating point multiplication unit is used for carrying out addition operation on exponent bits of floating point split data, carrying out exclusive OR operation on sign bits of the floating point split data, and carrying out multiplication operation on mantissas in the floating point split data or the input fixed point multiplexing floating point operand to obtain a corresponding floating point multiplication operation result or multiplexing multiplication operation result;

The floating point addition unit is used for carrying out addition operation on the floating point multiplication operation result or the multiplexing multiplication operation result to obtain a corresponding floating point addition operation result or a floating point path result, and beating the floating point path result which is output after the fixed point multiplexing floating point operand is subjected to multiplication and addition operation into an output selection unit of the fixed point operation module connected with the floating point addition unit, and beating the floating point addition operation result into the data normalization unit;

the data normalizing unit is used for normalizing floating-point multiplication operation results;

the rounding and status bit generating unit is used for rounding the normalized mantissa to obtain a rounded result, generating an abnormal status bit and outputting a floating point operation result in a floating point number format of a sign bit, a exponent bit and a mantissa bit.

In a further embodiment, the data integration result includes a fixed point data integration result or a floating point data integration result, and the data integration module is specifically configured to:

And responding to the input of the fixed-point instruction, selecting and integrating fixed-point operation results output by each fixed-point operation module according to the bit width of the instruction set, and generating a fixed-point data integration result.

and responding to the input of the floating point instruction, selecting and integrating the floating point operation results output by each floating point operation module according to the instruction set bit width, and generating a floating point data integration result.

In a second aspect, the present invention provides a fixed floating point SIMD multiply-add instruction fusion processing method, applied to a plurality of parallel operation modules, the method comprising the steps of:

According to the operand type and bit width, splitting an input fixed floating point operand into a plurality of split data, and distributing the split data to different operation modules; the split data comprises floating point operation operands with different bit widths and fixed point operands, wherein the floating point operation operands comprise floating point operands or fixed point multiplexing floating point operands;

Performing step by step operation on the floating point operation operands by using a preset floating point multi-stage pipeline structure to obtain floating point operation results, and beating floating point path results output after the multiplication and addition operation of the fixed point multiplexing floating point operands into fixed point number output selection pipeline stages;

Performing operation on the fixed-point operands step by utilizing a preset fixed-point multistage pipeline structure to obtain a fixed-point passage result, and outputting a fixed-point operation result according to the floating-point passage result and the fixed-point passage result;

And integrating the floating point operation result or the fixed point operation result according to the bit width of the instruction set to obtain a data integration result.

In a third aspect, the present invention also provides a processor, which includes a processor body and a fixed floating point SIMD multiply-add instruction fusion processing device provided in the processor body.

In a fourth aspect, the present invention also provides a computer device, including a processor and a memory, where the processor is connected to the memory, and the processor includes a processor body and a fixed-floating point SIMD multiply-add instruction fusion processing apparatus as described above provided in the processor body.

The invention provides a fixed-floating point SIMD multiply-add instruction fusion processing device, a method and a processor, wherein the device comprises a data splitting module, a floating point operation module, a fixed point operation module and a data integration module, wherein the data splitting module splits an input fixed-floating point operand into a plurality of split data and distributes the split data to different operation modules; the floating point operation module performs operation on the floating point operation operands, and beats the floating point path result output after the multiplication and addition operation of the fixed point multiplexing floating point operands to an output selection pipeline stage of the fixed point operation module connected with the floating point path result; the fixed point operation module operates the fixed point operand, and outputs a fixed point operation result according to the floating point passage result and the fixed point passage result, so that the operation result output by the floating point operation module or the fixed point operation module is integrated through the data integration module. Compared with the prior art, the device adopts a multiplier and adder multiplexing mode to realize the operation of fixed-point instructions and floating-point instructions, and can reduce the number of adders and multipliers, thereby effectively reducing the area of a circuit.

Drawings

FIG. 1 is a block diagram of a fixed-floating point SIMD multiply-add instruction fusion processing apparatus provided by an embodiment of the invention;

FIG. 2 is a diagram of scalar instructions versus vector instructions according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating different types of instruction data formats provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a fixed-point operation module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a floating point module according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a specific structure of a fixed-floating point SIMD multiply-add instruction fusion processing apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of operand splitting logic provided by an embodiment of the present invention;

FIG. 8 is a diagram illustrating an exemplary resource occupation of a fixed point arithmetic module multiplexing floating point computing resources according to an embodiment of the present invention;

FIG. 9 is a diagram illustrating an example of resource occupation in which a fixed point operation module does not reuse floating point computing resources according to an embodiment of the present invention;

FIG. 10 is a flow chart of a method for processing a fixed-floating point SIMD multiply-add instruction in a fusion manner, which is provided by an embodiment of the invention;

Fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following examples are given for the purpose of illustration only and are not to be construed as limiting the invention, including the drawings for reference and description only, and are not to be construed as limiting the scope of the invention as many variations thereof are possible without departing from the spirit and scope of the invention.

Referring to fig. 1, in order to facilitate understanding of the fixed-floating point SIMD multiply-add instruction fusion processing apparatus provided in the present embodiment, the present embodiment first briefly describes SIMD instructions and instruction sets:

SIMD (Single Instruction Multiple Data, single instruction stream multiple data stream) instructions, also called Vector instructions, are a common technique for spatial parallelism that can handle multiple data simultaneously for the same instruction, as shown in fig. 2, unlike scalar (Scalar) instructions, which can handle X0X Y0, X1X Y1, X2 and X3 data simultaneously, when the inputs are X and Y, scalar instructions can handle only X Y.

Table 1 is an instruction set table, as shown in table 1, the operations that the fixed-floating point SIMD multiply-add instruction fusion processing apparatus provided in this embodiment can process include fixed-floating point addition, subtraction, multiplication and multiply-add instructions, where these instructions include scalar instructions and vector instructions, the scalar instructions default to 1, where no analysis is performed, table 1 is as follows:

TABLE 1

In table 1, bit represents a word, occupying 1 bit of data bit width; dp represents the double-precision floating point number 64bit; sp represents a single precision floating point number 32bit; b represents byte, byte 8bit; h represents halfword, half word 16bit; w represents word, word 32bit, wherein different types of instruction data formats are shown in fig. 3, wherein Byte type fixed point SIMD instructions allow parallel operation of data of multiple Byte sizes in one clock cycle; halfword-type fixed-point SIMD instructions allow parallel operation of multiple half-word (16-bit) sized data in one clock cycle; word-type fixed point SIMD instructions allow parallel operation of multiple Word (32-bit) sized data in one clock cycle; single precision floating point SIMD instructions allow parallel operation of multiple single precision floating point numbers (32 bits) in one clock cycle; double-precision floating-point SIMD instructions allow parallel operation of multiple double-precision floating-point numbers (64 bits) in one clock cycle, which can be used on a processor supporting a SIMD instruction set, and utilize parallelism to accelerate the calculation process to improve calculation performance and efficiency; as can be seen from the instruction set shown in Table 1, the word type of the fixed-point instruction has the same space occupation and operation number as those of the single-precision floating point number, so that the word type of the fixed-point instruction can completely multiplex floating point resources, the number of the halfword type operation is twice that of the single-precision floating point number, and the data bit width is half, so that the halfword type of the fixed-point instruction can partially multiplex floating point resources, the number of the byte type operation is four times that of the single-precision floating point number, and the data bit width is one fourth, and therefore, the above analysis is integrated, and the embodiment provides a fixed-floating point SIMD multiply add instruction fusion processing device as shown in FIG. 1, which comprises a data splitting module 101, an operation device 102 and a data integration module 103 which are sequentially connected; the computing device 102 includes a plurality of parallel computing modules 104, and each computing module includes a fixed-point computing module 1041 and a floating-point computing module 1042, which are connected in a one-to-one correspondence.

The fixed-floating point SIMD multiply-add instruction fusion processing device adopted by the embodiment can enable fixed-point instructions to multiplex floating point operation hardware, and adopts the floating point operation module to process fixed-point multiplexing floating point operand instructions, so that the fixed-point multiplexing floating point operand instructions are converted into floating point numbers to be calculated, the hardware design is simplified, the number of components and the complexity are reduced, the capability of the floating point operation module can be fully utilized, more calculation capability and functions are provided, the calculation speed can be increased, and the overall performance and the efficiency of the system are improved.

The data splitting module 101 is configured to split an input fixed floating point operand into a plurality of split data according to an operand type and a bit width, and distribute the plurality of split data to different operation modules; the split data comprises floating point operation operands with different bit widths and fixed point operands, wherein the floating point operation operands comprise floating point operands or fixed point multiplexing floating point operands; it should be noted that, the fixed-point multiplexing floating-point operand refers to converting the fixed-point operand into a floating-point number for calculation in the calculation process, and converting the result back into the fixed-point number after the calculation is completed.

In this embodiment, the data splitting module is specifically configured to: performing type detection on an input fixed-floating point operand, if the type of the fixed-floating point operand belongs to a preset resource multiplexing type, splitting the fixed-floating point operand into a plurality of fixed-point multiplexing floating point operands and/or fixed-point operands according to the corresponding bit widths and instruction set bit widths, respectively distributing the fixed-point multiplexing floating point operands and/or fixed-point operands to floating point operation modules in different operation modules, and converting the fixed-point multiplexing floating point operands into bit widths of floating point mantissas for calculation in the calculation process through the structure of the multiplexing floating point operation modules; if the type of the fixed floating point operand belongs to a floating point type, splitting the fixed floating point operand into a plurality of floating point operands according to the corresponding bit width and instruction set bit width, and respectively distributing the plurality of floating point operands to floating point operation modules in different operation modules; if the type of the fixed floating point operand belongs to the fixed point type, splitting the fixed floating point operand into a plurality of fixed point operands according to the corresponding bit width and instruction set bit width, and respectively distributing the fixed point operands to fixed point operation modules in different operation modules; the resource multiplexing type is a type for converting part or all of fixed point numbers in fixed floating point operands into floating point numbers for calculation.

When floating point operation and fixed point operation are performed, different hardware resources are usually required for the floating point operation and the fixed point operation, in this embodiment, hardware resource sharing can be realized by adopting a multiplexing mode of converting a fixed point operand into a floating point number to perform calculation, and hardware resources required in a fixed point operation module are reduced, so that the occupied area of a processor is effectively reduced, and in addition, the number of times of the fixed point operation can be reduced to a certain extent by converting the fixed point number into the floating point number to perform calculation, and the consumption of hardware resources and time is reduced.

Specifically, the resource multiplexing types preset in this embodiment at least include a fixed-point word type and a fixed-point halfword type, where all instructions of the fixed-point word type multiplex a floating point operation module, and some instructions of the fixed-point halfword type multiplex a floating point operation module, so as to convert all or some instructions of the fixed-point word type and the fixed-point halfword type into floating point numbers for multiply-add calculation.

In the embodiment, the fixed-point word type instruction is completely multiplexed with the floating point operation module, the partial data of the fixed-point halfword type instruction is multiplexed with the floating point operation module, hardware resources can be shared by multiplexing the fixed-point word and halfword type instructions with floating point operation hardware, multipliers and adders required by the fixed-point word type instruction operation and halfword type parts are omitted, the number of components and complexity are reduced, and the utilization rate of the hardware resources is improved.

In this embodiment, the fixed point operation module 1041 is configured to operate on the fixed point operand step by using a preset fixed point multi-stage pipeline structure to obtain a fixed point path result, and output a fixed point operation result according to the floating point path result and the fixed point path result; the pipeline structure stages of the fixed point operation module and the floating point operation module are the same, wherein, as shown in fig. 4, the multi-stage pipeline structure of the fixed point operation module 1041 at least comprises a first fixed point multiplication unit 411, a second fixed point multiplication unit 412, a fixed point addition unit 413, an output selection unit 414 and a result output unit 415, which are arranged in different stages of pipeline structures; the fixed-point adding unit comprises a first adder and a second adder which are respectively connected with the second multiplier and the third multiplier.

The first fixed-point multiplication unit 411 is configured to multiply the input fixed-point operand to obtain a first multiplication result;

The second fixed-point multiplication unit 412 is configured to multiply the first multiplication result to obtain a second multiplication result;

The fixed point adding unit 413 is configured to perform an addition operation on the second multiplication result, to obtain a fixed point path result;

The output selecting unit 414 is configured to select the fixed point path result and the floating point path result in response to an input of a fixed floating point selection signal, so as to obtain a selection result, and output the selection result through the result output unit 415.

In this embodiment, the floating point operation module is configured to operate on the floating point operation operands step by using a preset floating point multi-stage pipeline structure to obtain a floating point operation result, and beat a floating point path result output after the multiply-add operation is performed on the fixed point multiplexed floating point operand to an output selection pipeline stage of the fixed point operation module connected thereto, as shown in fig. 5, the floating point operation module 1042 includes a data decoding unit 421, a floating point multiplication unit 422, a floating point addition unit 423, a data normalization unit 424, and a rounding and status bit generation unit 425.

The data decoding unit 421 is configured to split the input floating-point operand by sign bits, exponents and mantissas to obtain floating-point split data;

the floating-point multiplication unit 422 is configured to add exponent bits of the floating-point split data, exclusive-or sign bits of the floating-point split data, and multiply mantissas in the floating-point split data or the input fixed-point multiplexed floating-point operands to obtain corresponding floating-point multiplication results or multiplexed multiplication results;

The floating point addition unit 423 is configured to perform addition operation on the floating point multiplication result or the multiplexing multiplication result to obtain a corresponding floating point addition result or a floating point path result, beat a floating point path result output after the fixed point multiplexing floating point operand performs multiplication and addition operation into an output selection unit of the fixed point operation module connected with the floating point addition unit, and beat the floating point addition result into the data normalization unit;

the data normalizing unit 424 is configured to normalize the floating-point multiplication result;

the rounding and status bit generation unit 425 is configured to perform a rounding operation on the normalized mantissa to obtain a rounded result, generate an abnormal status bit, and output the floating point operation result in a floating point number format of sign bits, exponent bits, and mantissa bits.

Because the fixed-point operation and the floating-point operation need different hardware resources, and the fixed-point operation module and the floating-point operation module are respectively designed, more hardware components and more complex control logic are needed, after the floating-point operation module performs multiply-add operation on the fixed-point multiplexing floating-point operand, the result is converted back to the fixed-point operation module, so that the hardware resource sharing can be realized, the hardware resources needed by the fixed-point multiplexing floating-point operand calculation in the fixed-point operation module are reduced, the control logic is simplified, the calculation speed is accelerated, and the calculation efficiency is improved and the resource consumption is reduced. In this embodiment, the data integration module 103 is configured to integrate the operation result output by the floating point operation module or the fixed point operation module according to the instruction set bit width to obtain a data integration result, which specifically includes:

responding to the input of a fixed-point instruction, selecting and integrating fixed-point operation results output by each fixed-point operation module according to the bit width of an instruction set, and generating fixed-point data integration results;

Or responding to the input of the floating point instruction, selecting and integrating the floating point operation results output by each floating point operation module according to the instruction set bit width, and generating a floating point data integration result.

Because the fixed-floating point data operation generally has a plurality of operation results, the data integration module in the embodiment selects and integrates the plurality of operation results of the fixed-floating point data, so that the results can be quickly selected and integrated, the consistency and accuracy of final output data are ensured, and meanwhile, each operation result does not need to be processed respectively, so that unnecessary calculation and repeated operation can be avoided, the calculation efficiency is improved, and the calculation time and the resource consumption are reduced.

In order to facilitate the description of the fixed-floating point SIMD multiply-add instruction fusion processing apparatus provided in this embodiment, this embodiment is mainly described by way of example with reference to the instruction set bit width 128 bits and five execution cycles (five-stage pipeline) shown in table 1, as shown in fig. 6, in this embodiment, the data splitting module generates valid signals of the current instruction operation, and splits 128-bit operands to each operation module, where 64-bit floating point numbers are given to the floating point operation modules, and this embodiment preferentially sets four groups of operation modules, where each group of operation modules includes a floating point operation module and a fixed point operation module that are linked in a one-to-one correspondence, specifically: the first group of operation modules comprises a first floating point operation module 0 and a first fixed point operation module 0 which are connected, the second group of operation modules comprises a second floating point operation module 1 and a second fixed point operation module 1 which are connected, the third group of operation modules comprises a third floating point operation module 2 and a third fixed point operation module 2 which are connected, the fourth group of operation modules comprises a fourth floating point operation module 3 and a fourth fixed point operation module 3 which are connected, the embodiment preferably divides 128-bit operands into four parts, the four parts are respectively supplied to the four floating point operation modules for operation, the data length processed by each floating point operation module is 64 bits, and the 0 to 63 bits of operands are supplied to the first floating point operation module 0; operands 32 to 63, low 32 bits are patched with 0 to the second floating point arithmetic module 1; operand 64 to 127 bits to the third floating point arithmetic module 2; the operands 96 to 127, the low 32 bits are concatenated with 0 to the fourth floating point arithmetic module 3, and fig. 7 is a schematic diagram of the operand splitting logic provided in this embodiment, and the data splitting module generates valid signals of fixed point number and floating point number in this embodiment.

In this embodiment, a 53-bit operand is output for a fixed-point instruction requiring multiplexing of floating-point resources, wherein if the fixed-point number is word type, 0 to 31 bits of the operand are spliced with 0 or sign bits (sign instruction splice sign bits, unsigned instruction splice 0) on the upper 21 bits to the first floating-point operation module 0; the operands 32 to 63 bits and the upper 21 bits are spliced with 0 or sign bits to the second floating point operation module 1; the operand 64 to 95 bits and the upper 21 bits are spliced with 0 or sign bits to the third floating point operation module 2; the operands 96 to 127 bits, the upper 21 bits are concatenated with 0 or sign bits to the fourth floating point arithmetic module 3.

If the fixed point number of the halfword type is the fixed point number, 0 or sign bit (the fixed point number of the same word type) is spliced on 0 to 15 bits and 37 bits of the operand to the first floating point operation module 0; the operands 32 to 47 bits and the high 37 bits are spliced with 0 or sign bits to the second floating point operation module 1; the operands 64 to 79 bits and the high 37 bits are spliced with 0 or sign bits to the third floating point operation module 2; 96 to 111 bits of operands and 0 or sign bits are spliced on the upper 37 bits of the operands to the fourth floating point operation module 3; for operands input to the fixed-point operation module, the fixed-point number of the halfword type and the fixed-point number instruction of the byte type are output in a 16-bit format, the fixed-point number of the halfword type respectively outputs 16 to 31 bits of the operands, and the high 16 bits are spliced with 0 or sign bits to the first fixed-point operation module 0; 48 to 63 bits of operands and the upper 16 bits are spliced into 0 or sign bits to the second fixed point operation module 1; 80 to 95 bits of operands and the upper 16 bits are spliced with 0 or sign bits to the third fixed-point operation module 2; the operand 112 to 127 bits, the upper 16 bits concatenate 0 or sign bits to the fourth fixed point arithmetic module 3.

If the byte type is the byte type, the operand 0 to 31 bits are given to a first fixed point operation module 0; bits 32 to 63 of the operand are given to the second fixed point operation module 1; operand 64 to 95 bits to the third fixed point arithmetic module 2; the operands 96 to 127 bits are given to the fourth fixed point arithmetic module 3.

Specifically, each floating point arithmetic module splits the floating point number output by the data splitting module according to the format of sign bit, exponent bit and mantissa bit in the first stage pipeline structure, and expands the hidden bit of the mantissa, the maximum bit width of the hidden bit mantissa in double precision format is 53, therefore, the calculated bit width adopted in each floating point arithmetic module is 53 bits, which requires processing the maximum 53-bit mantissa (double-precision mantissa 53 bits, single-precision mantissa 24 bits), the maximum bit width of the single-precision floating point number is only 24 bits, therefore, for the single-precision floating point number, the fixed point number of the path of the floating point arithmetic module is split into 53 bits in the data splitting module, therefore, no additional processing is required in the first stage pipeline structure of the floating point arithmetic module, the multiplier and the adder adopted in the embodiment are 53 bits wide, wherein the second stage pipeline structure adopts 53 bits, the three-stage pipeline structure adopts the three-stage multiplier, the three-stage addition structure adopts the three-stage multiplier, the normalized bit multiplier is adopted in the three-stage pipeline structure, the normalized bit addition operation of the floating point arithmetic module is completed according to the state of the addition operation of the addition module, the normalized state of the addition operation is completed, the normalized state of the floating point arithmetic module is completed, the normalized state of the data is generated by the four-stage pipeline structure, the normalized state of the addition operation is completed, the normalized state of the floating point arithmetic module is completed, the normalized, the data is generated by the normalized state of the addition unit, and the normalized state of the addition unit is rounded, and the normal state is rounded, and the state is rounded, and the state is added, and the, the floating point number formats of the exponent bits and the mantissa bits are combined and output.

In this embodiment, the fixed point operation module is preferably set to five execution cycles (five-stage pipeline structure), and in the description of the input instruction set bit width 128 bits as an example, the structure of each fixed point operation module is specifically described by using five execution cycles of the first fixed point operation module 0 in fig. 6: in the first fixed-point operation module 0, the first fixed-point multiplication unit of the first stage pipeline structure comprises a 16-bit multiplier, the second fixed-point multiplication unit of the second stage pipeline structure comprises 4 8-bit multipliers and a 16-bit multiplier, the fixed-point addition unit of the third stage pipeline structure comprises 4 8-bit adders and a 16-bit adder, the fourth stage pipeline structure comprises an output selection unit, the result output unit of the fifth stage pipeline structure is used for outputting fixed-point operation results according to the selection results, specifically, the first stage pipeline structure of the fixed-point operation module splits the input 32-bit operands into 4 8-bit and 2 16-bit operands, and the multiplication calculation is directly carried out on the instructions of the halfword type, in order to make the calculation time needed by each stage pipeline similar, the embodiment splits the 16-bit multiplier into two parts, and respectively puts the two parts in the first stage pipeline and the second stage pipeline operation, and the fixed-point number of the byte type is not calculated in the first stage pipeline; the second-stage pipeline calculates 4 groups of 8-bit multipliers and 1 group of 16-bit multipliers, and simultaneously calculates 8-bit multiplications and 16-bit multiplications, and the third-stage pipeline adopts 4 groups of 8-bit adders and 1 16-bit adder, and simultaneously calculates 8-bit addition and 16-bit addition; the fourth-stage pipeline is used for acquiring a floating point path result of the multiplexing floating point operation module, and selecting and outputting an 8-bit or 16-bit instruction calculation result obtained by the fixed point operation module according to an effective signal generated by the data splitting module; and the fifth stage pipeline selects the floating point path result calculated by the floating point operation module and the fixed point path result calculated by the fixed point operation module again, and outputs the final selected result.

Compared with the traditional fixed-point operation module, the fixed-point word and halfword type instructions are input into the floating-point operation module for operation in a multiplexing mode of the multiplier and the adder, so that the 32-bit adder and the multiplier and the 16-bit multiplier and the adder are reduced, the area of a circuit is effectively reduced, the execution period of the fixed-point operation module is identical to that of the floating-point operation module, the control logic is simplified, and the overall performance and the efficiency of a system are improved.

It should be noted that, the fixed point operation module in this embodiment includes four 8-bit booth multipliers and adders, and this embodiment sets the fixed point word type instruction to completely multiplex the floating point operation module, and returns to the fixed point operation module for outputting after obtaining the multiplication and addition result; because 8 multipliers and adders are required in total and only 4 multipliers and adders are required for the fixed-point halfword instruction to complete operation, the embodiment needs to use 4 multipliers and adders for calculation in the fixed-point module, namely, half of the fixed-point halfword instruction is data multiplexed with the floating-point operation module and the other half of the fixed-point halfword instruction is realized in the fixed-point operation module; for the fixed-point byte type instruction, because the byte type is 8-bit operation, the bit width is smaller, the needed multiplier and adder resources are smaller, multiplexing omitted resources are fewer, if a multiplexing mode is adopted, the multiplexing control logic increases larger complexity, therefore, the embodiment sets the fixed-point byte type not to be multiplexed preferentially, and the fixed-point byte type instruction is realized in the fixed-point operation module entirely, but according to specific implementation conditions, a person skilled in the art can also carry out multiplexing operation on the fixed-point byte type instruction in a resource multiplexing mode.

As shown in fig. 8 and 9, the conventional processor is provided with a 32-bit multiplier and a 32-bit adder in the fixed point operation module, so that a 16-bit multiplier is required to be additionally arranged, the calculation resources of the fixed point operation module are increased, four execution cycles are adopted, if the fixed point is processed according to the conventional processor, the calculation resources are relatively large because the control logic and the execution cycles of the floating point are different, the embodiment adopts a mode of multiplexing the multiplier and the adder in the floating point operation module to realize the operation of the fixed point instruction and the floating point instruction, the 32-bit adder and the multiplier are omitted, the number of the adder and the multiplier is reduced, so that the area of the processor is effectively reduced, the fixed point operation module is designed to be identical to the execution cycles of the floating point operation module, for example, in order to be consistent with the five execution cycles of the floating point operation module, the fixed point operation module is divided into five cycles, so that the fixed point and the floating point are adopted with unified pipeline control logic, and the control logic is prevented from being complicated because the execution cycles are different, and the description is that for a brief description, the whole process of the instruction is as follows: the five steps of instruction fetching- > decoding- > executing- > memory accessing- > writing back are taken, but for a complex processor, if five steps are to be implemented, more than ten clock cycles or even tens of clock cycles are needed to be completed, the five steps further include instruction dispatch, instruction correlation detection, data bypass, writing back and other related logics, although the execution cycle of the instruction is reduced by one, the control logics need to be designed correspondingly, therefore, the execution cycle of the fixed point operation module and the floating point operation module are unified, and unified pipeline control logic is used, and although the processing mode loses a certain performance in more than one execution cycle of the fixed point instruction, the control logic is simplified; meanwhile, the device provided by the embodiment can realize the operation of SIMD multiply-add instructions of all types, reduces the complexity of design and improves the calculation accuracy.

Specifically, after the floating point operation module and the fixed point operation module are integrally calculated, 4 groups of floating point operation results and 4 groups of fixed point operation results are output totally, the data integration module selects 128 bits for outputting 8 calculation results according to the effective signals generated by the data splitting module, if the effective signals are fixed point instructions, the 32 bits of each fixed point operation module are all selected, the results of the fixed point operation module 0 are set to be 0 to 31 bits of the calculation results, the results of the fixed point operation module 1 are set to be 32 to 63 bits of the calculation results, the results of the fixed point operation module 2 are set to be 64 to 95 bits of the calculation results, and the results of the fixed point operation module 3 are set to be 96 to 127 bits of the calculation results.

The calculation result of the floating point operation module is 64 bits, and for the double-precision floating point instruction, the embodiment preferentially selects the floating point operation module 0 and the floating point operation module 2, and places the floating point operation module 0 and the floating point operation module 2 at the upper 64 bits and the lower 64 bits of the calculation result respectively; for single precision floating point instructions, the 64-bit result of each floating point operation module only outputs the high 32 bits, the result of the floating point operation module 0 is placed in the 0 to 31 bits of the calculated result, the result of the floating point operation module 1 is placed in the 32 to 63 bits of the calculated result, the result of the floating point operation module 2 is placed in the 64 to 95 bits of the calculated result, and the result of the floating point operation module 3 is placed in the 96 to 127 bits of the calculated result.

The fixed-floating point SIMD multiply-add instruction fusion processing device provided by the embodiment comprises four groups of operation modules, wherein the four groups of operation modules can meet all operation bit widths and numbers required by SIMD instructions, an input operand passes through a data splitting module, the operand with 128 bits is split into data with different bit widths, the data are respectively supplied to corresponding operation modules, after calculation is completed, calculation results of each group of operation modules are combined into 128bit data according to different instructions to be a final calculation result, and it is required to be noted that the fixed-floating point SIMD multiply-add instruction fusion processing device provided by the embodiment of the invention can also realize calculation of scalar instructions, when calculation of scalar instructions is realized, the fixed-point operation module 0 and the floating-point operation module 0 are used by default, and the fixed-point operation module 2 and the floating-point operation module 2 are used by default for vector double-precision floating-point instructions; the remaining SIMD instructions then use the full computing resources.

The embodiment of the invention provides a fixed-floating point SIMD multiply-add instruction fusion processing device, which comprises a data splitting module, a data integrating module and a plurality of parallel operation modules, wherein each group of operation modules comprises a floating point operation module and a fixed point operation module which are connected in a one-to-one correspondence manner; the data splitting module is used for splitting an input fixed floating point operand into a plurality of split data according to the operand type and bit width, and distributing the split data to different operation modules; the floating point operation module is used for beating the floating point path result output after the multiplication and addition operation of the fixed point multiplexing floating point operand into an output selection pipeline stage of the fixed point operation module connected with the floating point path result; the fixed point operation module is used for operating the fixed point operand step by utilizing a preset fixed point multistage pipeline structure to obtain a fixed point passage result, and outputting the fixed point operation result according to the floating point passage result and the fixed point passage result. Compared with the traditional processor which divides the fixed point instruction and the floating point instruction into operations, the method of multiplexing the multiplier and the adder in the floating point operation module is adopted to realize the operations of the fixed point instruction and the floating point instruction, and the number of the adder and the multiplier is reduced, so that the area of the processor is effectively reduced, and meanwhile, the device provided by the embodiment can realize the operations of the SIMD multiply-add instructions of all types, reduce the design complexity and improve the calculation accuracy.

In one embodiment, as shown in fig. 10, the embodiment of the invention provides a fixed floating point SIMD multiply-add instruction fusion processing method, which is applied to a plurality of parallel operation modules, and the method comprises the following steps:

S1, splitting an input fixed floating point operand into a plurality of split data according to the operand type and bit width, and distributing the split data to different operation modules; the split data comprises floating point operation operands with different bit widths and fixed point operands, wherein the floating point operation operands comprise floating point operands or fixed point multiplexing floating point operands;

S2, carrying out operation on the floating point operation operands step by utilizing a preset floating point multistage pipeline structure to obtain floating point operation results, and beating floating point path results output after the fixed point multiplexing floating point operands are subjected to multiply-add operation into fixed point number output selection pipeline stages;

S3, carrying out operation on the fixed-point operands step by utilizing a preset fixed-point multistage pipeline structure to obtain a fixed-point passage result, and outputting a fixed-point operation result according to the floating-point passage result and the fixed-point passage result;

S4, integrating the floating point operation result or the fixed point operation result according to the bit width of the instruction set to obtain a data integration result.

It should be noted that, the sequence number of each process does not mean that the execution sequence of each process is determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

The specific limitation of a fixed-floating point SIMD multiply-add instruction fusion processing method can be referred to the limitation of a fixed-floating point SIMD multiply-add instruction fusion processing device, and is not repeated here. Those of ordinary skill in the art will appreciate that the various modules and steps described in connection with the disclosed embodiments of the application may be implemented in hardware, software, or a combination of both. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application provides a fixed floating point SIMD multiply-add instruction fusion processing method, which splits an input fixed floating point operand into a plurality of split data according to the operand type and bit width and distributes the split data to different operation modules; carrying out step by step on floating point operation operands by utilizing a preset floating point multistage pipeline structure to obtain floating point operation results, and beating floating point path results output after the multiplication and addition operation of the fixed point multiplexing floating point operands into fixed point number output selection pipeline stages; performing operation on the fixed-point operands step by utilizing a preset fixed-point multistage pipeline structure to obtain a fixed-point passage result, and outputting a fixed-point operation result according to the floating-point passage result and the fixed-point passage result; and integrating the floating point operation result or the fixed point operation result according to the bit width of the instruction set to obtain a data integration result. Compared with the prior art, the method adopts a mode of multiplexing floating point resources, so that the calculation resources of the fixed point operation module can be reduced, and the execution cycle of the fixed point operation module and the floating point operation module is unified, so that unified pipeline control logic can be adopted for the fixed point operation module and the floating point operation module, and the control logic is simplified.

FIG. 11 is a diagram of a computer device including a memory, a processor, and a transceiver connected by a bus, according to an embodiment of the present invention; the processor comprises a processor body and the fixed-floating point SIMD multiply-add instruction fusion processing device which is arranged in the processor body.

Wherein the memory may comprise volatile memory or nonvolatile memory, or may comprise both volatile and nonvolatile memory; the processor may be a central processing unit, a microprocessor, an application specific integrated circuit, a programmable logic device, or a combination thereof. By way of example and not limitation, the programmable logic device described above may be a complex programmable logic device, a field programmable gate array, general purpose array logic, or any combination thereof.

In addition, the memory may be a physically separate unit or may be integrated with the processor.

It will be appreciated by those of ordinary skill in the art that the structure shown in FIG. 11 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be implemented, and that a particular computer device may include more or fewer components than those shown, or may combine some of the components, or have the same arrangement of components.

The fixed-floating point SIMD multiply-add instruction fusion processing device simplifies control logic by multiplexing floating point computing resources, reduces the number of multipliers and adders of a fixed-point computing module, improves the utilization rate of hardware, simultaneously contributes to reducing the area required by the realization of the adders and multiplier hardware, and effectively reduces the area occupied by the processor.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., SSD), etc.

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed, may comprise the steps of embodiments of the methods described above.

The foregoing examples represent only a few preferred embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that modifications and substitutions can be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and substitutions should also be considered to be within the scope of the present application. Therefore, the protection scope of the patent of the application is subject to the protection scope of the claims.

Claims

1. The fixed-floating point SIMD multiply-add instruction fusion processing device is characterized by comprising: the data splitting module, the computing device and the data integrating module are sequentially connected; the operation device comprises a plurality of parallel operation modules, and each operation module comprises a floating point operation module and a fixed point operation module which are connected in a one-to-one correspondence manner;

The floating point operation module is used for carrying out operation on the floating point operation operands step by utilizing a preset floating point multi-stage pipeline structure to obtain a floating point operation result, and beating a floating point path result output after the fixed point multiplexing floating point operation result is subjected to multiply-add operation into an output selection pipeline stage of the fixed point operation module connected with the floating point operation module; the preset floating point multistage pipeline structure is used for carrying out splitting, multiplying, adding and normalizing treatment on the floating point operands step by step to obtain a floating point operation result, and carrying out multiply-add operation on the fixed point multiplexing floating point operands to obtain a floating point passage result;

The fixed point operation module is used for operating the fixed point operand step by utilizing a preset fixed point multistage pipeline structure to obtain a fixed point passage result, and outputting a fixed point operation result according to the floating point passage result and the fixed point passage result; the pipeline structure stages of the fixed point operation module and the floating point operation module are the same; the preset fixed-point multistage pipeline structure is used for obtaining a fixed-point path result through addition operation after two-stage multiplication operation is continuously carried out on fixed-point operands;

2. The fixed-floating point SIMD multiply-add instruction fusion processing apparatus according to claim 1, wherein the data splitting module is specifically configured to:

3. The fixed-floating point SIMD multiply-add instruction fusion processing apparatus of claim 2, wherein: the resource multiplexing type at least comprises a fixed-point word type and a fixed-point halfword type, wherein all instructions of the fixed-point word type are multiplexed with a floating point operation module, and all or part of instructions of the fixed-point halfword type are multiplexed with a floating point operation module so as to convert all or part of instructions of the fixed-point word type and the fixed-point halfword type into floating point numbers for multiply-add calculation.

4. The fixed-floating point SIMD multiply-add instruction fusion processing apparatus of claim 1, wherein: the multi-stage pipeline structure of the fixed point operation module at least comprises a first fixed point multiplication unit, a second fixed point multiplication unit, a fixed point addition unit, an output selection unit and a result output unit which are arranged in different stages of pipeline structures; the fixed-point adding unit comprises a first adder and a second adder which are respectively connected with the second multiplier and the third multiplier;

5. The fixed-floating point SIMD multiply-add instruction fusion processing apparatus of claim 4, wherein: the floating point operation module comprises a data decoding unit, a floating point multiplication unit, a floating point addition unit, a data normalization unit and a rounding and status bit generation unit;

the data normalizing unit is used for normalizing floating point addition operation results;

6. The apparatus of claim 1, wherein the data integration result comprises a fixed point data integration result or a floating point data integration result, and the data integration module is specifically configured to:

7. The apparatus of claim 1, wherein the data integration result comprises a fixed point data integration result or a floating point data integration result, and the data integration module is specifically configured to:

8. The fixed-floating point SIMD multiply-add instruction fusion processing method is characterized by being applied to a plurality of parallel operation modules, and comprises the following steps of:

Performing step by step operation on the floating point operation operands by using a preset floating point multi-stage pipeline structure to obtain floating point operation results, and beating floating point path results output after the multiplication and addition operation of the fixed point multiplexing floating point operands into fixed point number output selection pipeline stages; the preset floating point multistage pipeline structure is used for carrying out splitting, multiplying, adding and normalizing treatment on the floating point operands step by step to obtain a floating point operation result, and carrying out multiply-add operation on the fixed point multiplexing floating point operands to obtain a floating point passage result;

performing operation on the fixed-point operands step by utilizing a preset fixed-point multistage pipeline structure to obtain a fixed-point passage result, and outputting a fixed-point operation result according to the floating-point passage result and the fixed-point passage result; the preset fixed-point multistage pipeline structure is used for obtaining a fixed-point path result through addition operation after two-stage multiplication operation is continuously carried out on fixed-point operands;

9. A processor, characterized by: a fixed-floating point SIMD multiply-add instruction fusion processing apparatus according to any one of claims 1 to 7, comprising a processor body and a fixed-floating point SIMD multiply-add instruction fusion processing apparatus disposed in the processor body.

10. A computer device, characterized by: the device comprises a processor and a memory, wherein the processor is connected with the memory, and comprises a processor body and the fixed-floating point SIMD multiply-add instruction fusion processing device according to any one of claims 1 to 7 arranged in the processor body.