CN115826910A - Vector fixed point ALU processing system - Google Patents

Vector fixed point ALU processing system Download PDF

Info

Publication number
CN115826910A
CN115826910A CN202310070128.4A CN202310070128A CN115826910A CN 115826910 A CN115826910 A CN 115826910A CN 202310070128 A CN202310070128 A CN 202310070128A CN 115826910 A CN115826910 A CN 115826910A
Authority
CN
China
Prior art keywords
bit
carry
bits
source data
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310070128.4A
Other languages
Chinese (zh)
Other versions
CN115826910B (en
Inventor
李霞
周琦
贾筠
李晋
王荣丰
霍旭东
杜鹰
胡波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Sunway Technology Co ltd
Original Assignee
Chengdu Sunway Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sunway Technology Co ltd filed Critical Chengdu Sunway Technology Co ltd
Priority to CN202310070128.4A priority Critical patent/CN115826910B/en
Publication of CN115826910A publication Critical patent/CN115826910A/en
Application granted granted Critical
Publication of CN115826910B publication Critical patent/CN115826910B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a vector fixed point ALU processing system, comprising: the system comprises a decoder, an instruction transmitting subsystem, a register file and an execution subsystem; the arithmetic units are provided with a plurality of arithmetic modules with different bit widths in the ALU arithmetic subunit for parallel processing, thereby improving the arithmetic speed and solving the problems that the arithmetic units generally select the bit width of the data (namely the element bit width) and then carry out calculation, the calculation structure has multiple layers, and the calculation speed is slow when the calculation of the data with various bit widths is involved. The fixed point decimal is directly processed through the fixed point average addition module, the fixed point saturation addition module and the fixed point shift module, the fixed point decimal is not required to be calculated as a floating point number, and the exponent and mantissa parts are not calculated, so that the problems of complex structure, high calculation cost and long calculation period of a floating point calculation unit are solved.

Description

Vector fixed-point ALU processing system
Technical Field
The invention relates to the technical field of integrated circuits, in particular to a vector fixed point ALU processing system.
Background
The semiconductor and integrated circuit industry is an important core industry in the information age, and is a strategic, fundamental and leading industry supporting the development of the economic society. With the acceleration of information processing, the performance of integrated circuit chips is becoming higher and higher, and the data processing mode of the chip greatly determines the performance of the chip. For large amounts of data, parallel processing methods may be employed. The data vectorization is one of parallel processing data, each data is collected in one group of vectors, and the two groups of vector source data are calculated, so that the effect of multi-data processing can be achieved. The vector processing is an effective method for improving the batch processing data and the performance of the processor, and the key part of the vector processing is the design of an arithmetic unit. The data processed by the arithmetic logic units of most CPUs generally has two data types, one is integer and the other is floating point. The fixed point number comprises a pure integer and a pure decimal, the pure decimal can enter the floating point arithmetic unit for calculation, and the pure integer enters the integer arithmetic unit for calculation.
The problems of the prior art are as follows:
1. the arithmetic unit for processing data in the chip CPU generally selects the data bit width (namely the element bit width) and then performs calculation, and the calculation structure has multiple layers, and the calculation speed is low when the calculation of data with multiple bit widths is involved.
2. The data is divided into fixed point and floating point, the fixed point is divided into three conditions of pure integer, pure decimal, integer and decimal, most of pure decimal calculation adopts floating point unit calculation, the floating point calculation unit has complex structure, high calculation cost and long calculation period.
3. The floating point is divided into a sign bit part, an exponent part and a mantissa part, an algorithm is mapped to a hardware circuit and is formed by adding a part of adders, and the pure decimal number is calculated by adding twice, so that the calculation time is increased.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides: a vector fixed point ALU processing system solves the problems of the prior art in the background.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a vector fixed point ALU processing system, comprising: the system comprises a decoder, an instruction transmitting subsystem, a register file and an execution subsystem;
the decoder is used for decoding the vector instruction to obtain a decoded vector instruction; the transmitting instruction subsystem is used for receiving the decoding vector instruction, determining an execution subsystem for executing operation according to the operation type in the decoding vector instruction, and sending address data in the decoding vector instruction to the register file; the execution subsystem is used for executing a data operation instruction in the decoding vector instruction; the register file is used for storing operation results.
Further, the decode vector instruction is 32-bits in length, including: vector instruction type, accounting for 7 bits; the address of the target register occupies 5 bits; a funct3 opcode, occupying 3 bits; the address of the first source register occupies 5 bits; the address of the second source register occupies 5 bits; the mask operation is enabled and occupies 1 bit; the funct6 opcode takes 6 bits.
Further, the execution subsystem includes: the device comprises a first channel unit, a second channel unit, a third channel unit and a fourth channel unit;
an ALU operator subunit is included in each channel unit.
Furthermore, the execution subsystem calls the data to be processed from the register file, divides the data to be processed into 4 parts, and sends each part of the data to be processed into one channel unit, wherein the channel units into which the 4 parts of the data to be processed enter are different.
Further, in each channel unit, an ALU operation subunit performs operation on the data to be processed in the channel unit;
the operation types include: integer arithmetic, integer logic, integer move, integer reduce, fixed point average add-subtract, fixed point saturated add-subtract, fixed point logic, and fixed point narrow instructions.
Further, the ALU operator subunit includes: the device comprises an integer addition module, a fixed point average addition module, a fixed point saturation addition module, a fixed point shift module, an integer logic module and an integer comparison module;
the integer adding module comprises: 8-bit adders, 16-bit adders, 32-bit adders and 64-bit adders; the fixed point average adding module comprises: 8-bit adders, 16-bit adders, 32-bit adders and 64-bit adders; the fixed point saturation addition module comprises: 8-bit adders, 16-bit adders, 32-bit adders and 64-bit adders; the fixed point shift module includes: 8-bit fixed point shift, 16-bit fixed point shift, 32-bit fixed point shift, and 64-bit fixed point shift; the integer shift module includes: 8-bit integer shift, 16-bit integer shift, 32-bit integer shift, and 64-bit integer shift; the integer logic module comprises: and, OR, XOR; the integer comparison module is used for making difference on data input into the ALU operation subunit and judging the size according to the difference value.
Further, the 8-bit adder in the integer adding module comprises: two selectors and 3 4-bit carry look ahead adders; the 16-bit adder in the integer adding module comprises: two selectors and 3 8-bit selection carry adders; the 32-bit adder in the integer adding module comprises: two selectors and 3 16-bit selection carry adders; the 64-bit adder in the integer adding module comprises: two selectors and 3 32-bit selection carry adders.
Further, the operation expression of the 4-bit carry look-ahead adder is as follows:
Figure SMS_1
Figure SMS_2
wherein ,
Figure SMS_5
is the first of a 4-bit carry look-ahead adder
Figure SMS_7
Bit output, 4 bits in total;
Figure SMS_10
for source data of the first kind input to a 4-bit carry look-ahead adder
Figure SMS_4
Bit, 4 in total;
Figure SMS_8
for source data of the second kind input to a 4-bit carry look-ahead adder
Figure SMS_11
Bit, 4 in total;
Figure SMS_12
is as follows
Figure SMS_3
The output of the NOR gates, the NOR gates are 3 in total,
Figure SMS_6
is as follows
Figure SMS_9
An output of the NOR gate;
Figure SMS_13
is an exclusive or operation.
Further, the multi-bit adder in the integer addition module comprises: the first selector, the second selector, the first multi-bit adder sub-module, the second multi-bit adder sub-module and the third multi-bit adder sub-module, wherein the number of bits of the multi-bit adder is N, and the value of the number of bits of the multi-bit adder sub-module which is N/2,N comprises: 8. 16, 32 and 64, the multi-bit adder sub-module comprising: carry look ahead adder and carry select adder;
the data input to the N-bit adder in the integer addition module includes: the first-class source data and the second-class source data are both N bits, and the first-class source data and the second-class source data after entering the N-bit adder are divided into high N/2 bits and low N/2 bits to obtain low N/2 bits of the first-class source data, high N/2 bits of the first-class source data, low N/2 bits of the second-class source data and high N/2 bits of the second-class source data;
the inputs to the first N/2-bit adder sub-module include: the low N/2 bits of the first kind of source data, the low N/2 bits of the second kind of source data and the input carry bit, and the output comprises: the first output carry sum outputs a low N/2 bit sum; the inputs to the second N/2-bit adder sub-module include: the high N/2 bit of the first kind of source data, the high N/2 bit of the second kind of source data and the low carry bit, and the output comprises: the second output carry and the first output high N/2 bit sum; the inputs to the third N/2-bit adder sub-module include: the high N/2 bit of the first kind of source data, the high N/2 bit of the second kind of source data and the high carry bit, and the output comprises: the third output carry and the second output high N/2 bit sum; the input of the first selector comprises: the first output carry, the second output carry and the third output carry, and the output thereof includes: selecting an output carry; the input of the second selector comprises: a first output carry, a first output high N/2 bit sum, and a second output high N/2 bit sum, the outputs of which include: selecting and outputting high N/2 bit sum; and the high N/2 bit sum and the low N/2 bit sum are selectively output and spliced, and the calculation result of the N-bit adder is obtained by combining the selective output carry.
Further, the address data includes: the address of the source register and the address of the destination register.
The technical scheme of the embodiment of the invention at least has the following advantages and beneficial effects:
1. the arithmetic units are provided with a plurality of arithmetic modules with different bit widths in the ALU arithmetic subunit for parallel processing, thereby improving the arithmetic speed and solving the problems that the arithmetic units generally select the bit width of the data (namely the element bit width) and then carry out calculation, the calculation structure has multiple layers, and the calculation speed is slow when the calculation of the data with various bit widths is involved.
2. The fixed point decimal fraction is directly processed through the fixed point average addition module, the fixed point saturation addition module and the fixed point shift module, the fixed point decimal fraction does not need to be calculated as a floating point number, and the exponent and mantissa parts are not calculated, so that the problems of complex structure, high calculation cost and long calculation period of a floating point calculation unit are solved.
Drawings
FIG. 1 is a system block diagram of a vector fixed-point ALU processing system;
FIG. 2 is a block diagram of a decode vector instruction;
FIG. 3 is a diagram of multi-channel data parallel allocation;
FIG. 4 is a schematic diagram of a 64-bit ALU operator unit;
FIG. 5 is a schematic diagram of a 4-bit carry look-ahead adder;
FIG. 6 is a schematic diagram of an 8-bit adder;
FIG. 7 is a schematic diagram of a 16-bit adder;
FIG. 8 is a schematic diagram of a 32-bit adder;
FIG. 9 is a schematic diagram of a 64-bit adder.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
As shown in fig. 1, a vector fixed-point ALU processing system comprises: the system comprises a decoder, an instruction transmitting subsystem, a register file and an execution subsystem;
the decoder is used for decoding the vector instruction to obtain a decoded vector instruction; the transmitting instruction subsystem is used for receiving the decoding vector instruction, determining an execution subsystem for executing operation according to the operation type in the decoding vector instruction, and sending address data in the decoding vector instruction to the register file; the execution subsystem is used for executing a data operation instruction in the decoding vector instruction; the register file is used for storing operation results.
The address data includes: the address of the source register and the address of the destination register.
The decode vector instruction is 32-bits in length, including: vector instruction type, accounting for 7 bits; the address of the target register occupies 5 bits; a funct3 opcode, occupying 3 bits; the address of the first source register occupies 5 bits; the address of the second source register occupies 5 bits; the mask operation is enabled, accounting for 1 bit; the funct6 opcode takes 6 bits.
As shown in FIG. 2, FIG. 2 shows a 3-frame decoded vector instruction, which is 32-bits long, (I) represents the type of vector instruction, and takes 7 bits; (II) represents the address of the target register, and occupies 5 bits; (III) indicating that the funct3 operation code occupies 3 bits; (IV) represents the address of the first source register, occupying 5 bits; (V) represents the address of the second source register, occupying 5 bits; (VI) indicates that the masking operation is enabled, accounting for 1 bit; and (VII) represents a funct6 operation code, and occupies 6 bits.
The specific functions of each part are as follows:
instruction type (7 bits), 1010111, denotes a compute instruction to select an execution subsystem to execute a vector instruction, which in this ALU processing system, is constantly 1010111.
(II) vd target register Address (5 bits), the address where the vector register file is located in the present system, stores the output data of the ALU operator subunit to the target register Address.
(iii) funct3 opcode (3 bits), there are 3 cases of funct3 in the present system, funct3=000 indicating that the execution subsystem performs the calculation of the vector sum vector; funct3=100 indicates that the execution subsystem performs calculations of vectors and scalars; funct3=011 shows that the execution subsystem performs calculations of vectors and immediate.
(iv) vs1/rs1/imm [4:0] (5 bits), where funct3=000, vs1 is the address of the first source register, and the operand "i.e., the first type of source data" input to the ALU operator unit is given from the address of the first source register; when funct3=100, rs1 indicates that the operand is scalar data; funct3=011 indicates that the operand is an immediate.
(V) vs2 the address of the second source register (5 bits), from which the operand input to the ALU operator subunit is given, i.e. the second type of source data, is in this system the vector register file address.
(vi) vm mask operation bit (1 bit) indicating whether or not to use a mask operation, vm =0 indicating to use a mask, vm =1 indicating not to use a mask.
(VII) a funct6 opcode (6 bits) that instructs the ALU operator subunit to select one of an arithmetic or logical operation, the arithmetic operation being between two numbers, the result of the lower operation having an effect on the upper bits (carry); the logic operation is as follows: the operation is carried out according to the bit, and the low-order operation result has no influence on the high order (no carry); and sending the data to the execution subsystems according to different funct6 operation codes, and selecting different execution subsystems for operation.
The execution subsystem includes: the device comprises a first channel unit, a second channel unit, a third channel unit and a fourth channel unit;
an ALU operator subunit is included in each channel unit.
In each channel unit, an ALU operation subunit performs operation on the data to be processed in the channel unit;
the operation types include: integer arithmetic, integer logic, integer move, integer reduce, fixed point average add-subtract, fixed point saturated add-subtract, fixed point logic, and fixed point narrow instructions.
The execution subsystem calls the data to be processed from the register file, divides the data to be processed into 4 parts, and sends each part of the data to be processed into one channel unit, wherein the channel units into which the 4 parts of the data to be processed enter are different.
The storage of the data of the memory into the register file requires shuffling, i.e. flower arrangement. The data in the two source registers are averagely divided into 4 parts and correspondingly enter different channels, two paths of data, namely first-class source data and second-class source data are arranged in each channel, data flower arrangement distribution is shown in figure 3, 64 bits, namely 8 bytes of data, are calculated in one channel at a time, 256 bits of data, namely 16 bytes of data are calculated in 4 channels simultaneously, parallel execution is carried out, and the speed of source data operation in the vector register is accelerated. And correspondingly storing the calculated result in a target register, and also following the flower arrangement rule.
As shown in fig. 3, the value of the byte SEW is determined by the vsew [2:0] field in the privilege register vtype, the width of the data stored in the source register is determined according to the SEW, and the data in the source register is equally divided into a plurality of equal-width data specifying the bit width of the SEW, one SEW-bit-wide data being one element. If SEW =64, 64 bits are calculated in one channel at a time, and 4 channels simultaneously calculate 4 64 bits of data; if SEW =32, 64 bits are calculated in one channel at a time, and 8 32 bits of data are calculated by 4 channels simultaneously; if SEW =16, 64 bits are calculated in one channel at a time, and 16 bits of data are calculated by 4 channels at the same time; if SEW =8, 64 bits are calculated in one channel at a time, and 32 8 bits of data are calculated simultaneously by 4 channels.
If the data of the vector exceeds 256 bits of data, multiple cycles are required for the computation to complete, cycle = number of bits of total data/number of bits of 4 channel computations (division rounded up). Examples are: if the data of the vector is 1024-bit data, 1024/256=4, 4 cycles are needed to complete the execution. In the first period, 256 bits of data [ 255; in the second cycle, 4 channels simultaneously calculate 256 bits of data [ 511; in the third period, the 4 channels calculate 256 bits of data [ 767; in the fourth cycle, 4 channels simultaneously calculate 256 bits of data [ 1023.
When SEW =64, one element occupies eight bytes, one element represents a single vector element, source data in a source register is equally divided into a plurality of elements according to the eight bytes, the source data are sequentially stored in a register file in a flower arrangement mode, and values of the register file are taken by a first channel unit, a second channel unit, a third channel unit and a fourth channel unit correspondingly to perform data calculation. The 0 th element of the first type of source data is stored in the 0X 00-0X 07 th byte position of the first source register, and the 0 th element of the second type of source data is stored in the 0X 00-0X 07 th byte position of the second source register; the 1 st element of the first type source data is stored in the 0X 08-0X 0F byte positions of the first source register, and the 1 st element of the second type source data is stored in the 0X 08-0X 0F byte positions of the second source register; the 2 nd element of the first type of source data is stored in the 0X 10-0X 17 th byte position of the first source register, and the 2 nd element of the second type of source data is stored in the 0X 10-0X 17 th byte position of the second source register; the 3 rd element of the first type of source data is stored in the 0X 18-0X 1F byte position of the first source register, and the 3 rd element of the second type of source data is stored in the 0X 18-0X 1F byte position of the second source register.
When SEW =32, one element occupies four bytes, source data in a source register is equally divided into a plurality of elements according to the four bytes, the elements are sequentially stored in a register file in an inserting manner, and values of the register file corresponding to a first channel unit, a second channel unit, a third channel unit and a fourth channel unit are calculated to calculate data. The 0 th element of the first type of source data is stored in the 0X 00-0X 03 byte position of the first source register, and the 0 th element of the second type of source data is stored in the 0X 00-0X 03 byte position of the second source register; the 1 st element of the first type of source data is stored in the 0X 08-0X 0B byte position of the first source register, and the 1 st element of the second type of source data is stored in the 0X 08-0X 0B byte position of the second source register; the 2 nd element of the first type of source data is stored in the 0X 10-0X 13 th byte position of the first source register, and the 2 nd element of the second type of source data is stored in the 0X 10-0X 13 th byte position of the second source register; the 3 rd element of the first type of source data is stored in the 0X 18-0X 1B byte position of the first source register, and the 3 rd element of the second type of source data is stored in the 0X 18-0X 1B byte position of the second source register; the 4 th element of the first type of source data is stored in the 0X 04-0X 07 th byte position of the first source register, and the 4 th element of the second type of source data is stored in the 0X 04-0X 07 th byte position of the second source register; the 5 th element of the first type of source data is stored in the 0X 0C-0X 0F byte position of the first source register, and the 5 th element of the second type of source data is stored in the 0X 0C-0X 0F byte position of the second source register; the 6 th element of the first type of source data is stored in the 0X 14-0X 17 byte position of the first source register, and the 6 th element of the second type of source data is stored in the 0X 14-0X 17 byte position of the second source register; the 7 th element of the first type of source data is stored in the 0X 1C-0X 1F byte position of the first source register, and the 7 th element of the second type of source data is stored in the 0X 1C-0X 1F byte position of the second source register.
When SEW =16, one element occupies two bytes, source data in a source register is equally divided into a plurality of elements according to the two bytes, the source data is sequentially stored in a register file in an inserting manner, and values of the register file are taken by a first channel unit, a second channel unit, a third channel unit and a fourth channel unit correspondingly to perform data calculation. The 0 th element of the first type of source data is stored in the 0X 00-0X 01 byte position of the first source register, and the 0 th element of the second type of source data is stored in the 0X 00-0X 01 byte position of the second source register; the 1 st element of the first type of source data is stored in the 0X 08-0X 09 byte position of the first source register, and the 1 st element of the second type of source data is stored in the 0X 08-0X 09 byte position of the second source register; the 2 nd element of the first type of source data is stored in the 0X 10-0X 11 byte position of the first source register, and the 2 nd element of the second type of source data is stored in the 0X 10-0X 11 byte position of the second source register; the 3 rd element of the first type of source data is stored in the 0X 18-0X 19 byte position of the first source register, and the 3 rd element of the second type of source data is stored in the 0X 18-0X 19 byte position of the second source register; the 4 th element of the first type of source data is stored in the 0X 04-0X 05 byte position of the first source register, and the 4 th element of the second type of source data is stored in the 0X 04-0X 05 byte position of the second source register; the 5 th element of the first type of source data is stored in the 0X 0C-0X 0D byte position of the first source register, and the 5 th element of the second type of source data is stored in the 0X 0C-0X 0D byte position of the second source register; the 6 th element of the first type of source data is stored in the 0X 14-0X 15 byte position of the first source register, and the 6 th element of the second type of source data is stored in the 0X 14-0X 15 byte position of the second source register; the 7 th element of the first type of source data is stored in the 0X 1C-0X 1D byte position of the first source register, and the 7 th element of the second type of source data is stored in the 0X 1C-0X 1D byte position of the second source register; the 8 th element of the first type of source data is stored in the 0X 02-0X 03 byte position of the first source register, and the 8 th element of the second type of source data is stored in the 0X 02-0X 03 byte position of the second source register; the 9 th element of the first type of source data is stored in the 0X 0A-0X 0B byte position of the first source register, and the 9 th element of the second type of source data is stored in the 0X 0A-0X 0B byte position of the second source register; the 10 th (0X 0A) element of the first type source data is stored in the 0X 12-0X 13 byte positions of the first source register, and the 10 th (0X 0A) element of the second type source data is stored in the 0X 12-0X 13 byte positions of the second source register; the 11 th (0X 0B) element of the first type of source data is stored in the 0X 1A-0X 1B byte position of the first source register, and the 11 th (0X 0B) element of the second type of source data is stored in the 0X 1A-0X 1B byte position of the second source register; the 12 th (0X 0C) element of the first type of source data is stored in the 0X 06-0X 07 byte position of the first source register, and the 12 th (0X 0C) element of the second type of source data is stored in the 0X 06-0X 07 byte position of the second source register; the 13 th (0X 0D) element of the first type of source data is stored in the 0X 0E-0X 0F byte position of the first source register, and the 13 th (0X 0D) element of the second type of source data is stored in the 0X 0E-0X 0F byte position of the second source register; the 14 th (0X 0E) element of the first type of source data is stored in the 0X 16-0X 17 byte position of the first source register, and the 14 th (0X 0E) element of the second type of source data is stored in the 0X 16-0X 17 byte position of the second source register; the 15 th (0X 0F) element of the first type of source data is stored in the first source register at byte positions 0X1E to 0X1F, and the 15 th (0X 0F) element of the second type of source data is stored in byte positions 0X1E to 0X1F of the second source register.
When SEW =8, one element occupies one byte, source data in a source register is equally divided into a plurality of elements according to 1 byte, the elements are sequentially stored in a register file in an inserting manner, and values of the register file corresponding to a first channel unit, a second channel unit, a third channel unit and a fourth channel unit are calculated to perform data calculation. The 0 th element of the first type of source data is stored in the 0X00 byte position of the first source register, and the 0 th element of the second type of source data is stored in the 0X00 byte position of the second source register; the 1 st element of the first type of source data is stored at the 0X08 byte position of the first source register, and the 1 st element of the second type of source data is stored at the 0X08 byte position of the second source register; the 2 nd element of the first type of source data is stored at the 0X10 byte position of the first source register, and the 2 nd element of the second type of source data is stored at the 0X10 byte position of the second source register; the 3 rd element of the first type of source data is stored at the 0X18 byte position of the first source register, and the 3 rd element of the second type of source data is stored at the 0X18 byte position of the second source register; the 4 th element of the first type of source data is stored at the 0X04 byte position of the first source register, and the 4 th element of the second type of source data is stored at the 0X04 byte position of the second source register; the 5 th element of the first type of source data is stored at the 0X0C byte position of the first source register, and the 5 th element of the second type of source data is stored at the 0X0C byte position of the second source register; the 6 th element of the first type of source data is stored at the 0X14 byte position of the first source register, and the 6 th element of the second type of source data is stored at the 0X14 byte position of the second source register; the 7 th element of the first type of source data is stored in the 0X1C byte position of the first source register, and the 7 th element of the second type of source data is stored in the 0X1C byte position of the second source register; the 8 th element of the first type of source data is stored in the 0X02 byte position of the first source register, and the 8 th element of the second type of source data is stored in the 0X02 byte position of the second source register; the 9 th element of the first type of source data is stored at the 0X0A byte position of the first source register, and the 9 th element of the second type of source data is stored at the 0X0A byte position of the second source register; the 10 th (0X 0A) element of the first type of source data is stored in the 0X12 byte position of the first source register, and the 10 th (0X 0A) element of the second type of source data is stored in the 0X12 byte position of the second source register; the 11 th (0X 0B) element of the first type of source data is stored in the 0X1A byte position of the first source register, and the 11 th (0X 0B) element of the second type of source data is stored in the 0X1A byte position of the second source register; the 12 th (0X 0C) element of the first type of source data is stored at the 0X06 byte position of the first source register, and the 12 th (0X 0C) element of the second type of source data is stored at the 0X06 byte position of the second source register; the 13 th (0X 0D) element of the first type of source data is stored at the 0X0E byte position of the first source register, and the 13 th (0X 0D) element of the second type of source data is stored at the 0X0E byte position of the second source register; the 14 th (0X 0E) element of the first type of source data is stored in the 0X16 byte position of the first source register, and the 14 th (0X 0E) element of the second type of source data is stored in the 0X16 byte position of the second source register; the 15 th (0X 0F) element of the first type of source data is stored in the 0X1E byte position of the first source register, and the 15 th (0X 0F) element of the second type of source data is stored in the 0X1E byte position of the second source register; the 16 th (0X 10) element of the first type of source data is stored at the 0X01 byte position of the first source register, and the 16 th (0X 10) element of the second type of source data is stored at the 0X01 byte position of the second source register; the 17 th (0X 11) element of the first type of source data is stored at the 0X09 byte position of the first source register, and the 17 th (0X 11) element of the second type of source data is stored at the 0X09 byte position of the second source register; the 18 th (0X 12) element of the first type of source data is stored at the 0X11 byte position of the first source register, and the 18 th (0X 12) element of the second type of source data is stored at the 0X11 byte position of the second source register; the 19 th (0X 13) element of the first type of source data is stored at the 0X19 byte position of the first source register, and the 19 th (0X 13) element of the second type of source data is stored at the 0X19 byte position of the second source register; the 20 th (0X 14) element of the first type of source data is stored at the 0X05 byte position of the first source register, and the 20 th (0X 14) element of the second type of source data is stored at the 0X05 byte position of the second source register; the 21 st (0X 15) element of the first type of source data is stored at the 0X0D byte position of the first source register, and the 21 st (0X 15) element of the second type of source data is stored at the 0X0D byte position of the second source register; the 22 th (0X 16) element of the first type of source data is stored at the 0X15 byte position of the first source register, and the 22 th (0X 16) element of the second type of source data is stored at the 0X15 byte position of the second source register; the 23 (0X 17) th element of the first type of source data is stored at the 0X1D byte position of the first source register, and the 23 (0X 17) th element of the second type of source data is stored at the 0X1D byte position of the second source register; the 24 th (0X 18) element of the first type of source data is stored at the 0X03 th byte position of the first source register, and the 24 th (0X 18) element of the second type of source data is stored at the 0X03 th byte position of the second source register; the 25 th (0X 19) element of the first type of source data is stored in the 0X0B byte position of the first source register, and the 25 th (0X 19) element of the second type of source data is stored in the 0X0B byte position of the second source register; the 26 th (0X 1A) element of the first type of source data is stored at the 0X13 byte position of the first source register, and the 26 th (0X 1A) element of the second type of source data is stored at the 0X13 byte position of the second source register; the 27 th (0X 1B) element of the first type of source data is stored in the 0X1B byte position of the first source register, and the 27 th (0X 1B) element of the second type of source data is stored in the 0X1B byte position of the second source register; the 28 th (0X 1C) element of the first type of source data is stored at the 0X07 th byte position of the first source register, and the 28 th (0X 1C) element of the second type of source data is stored at the 0X07 th byte position of the second source register; the 29 th (0X 1D) element of the first type of source data is stored at the 0X0F byte position of the first source register, and the 29 th (0X 1D) element of the second type of source data is stored at the 0X0F byte position of the second source register; the 30 th (0X 1E) element of the first type of source data is stored in the 0X17 byte position of the first source register, and the 30 th (0X 1E) element of the second type of source data is stored in the 0X17 byte position of the second source register; the 31 st (0X 1F) element of the first type of source data is stored in the 0X1F byte position of the first source register, and the 31 st (0X 1F) element of the second type of source data is stored in the 0X1F byte position of the second source register.
The ALU operation subunit comprises: the device comprises an integer addition module, a fixed point average addition module, a fixed point saturation addition module, a fixed point shift module, an integer logic module and an integer comparison module;
the integer adding module comprises: 8-bit adders, 16-bit adders, 32-bit adders and 64-bit adders; the fixed point average adding module comprises: 8-bit adders, 16-bit adders, 32-bit adders and 64-bit adders; the fixed point saturation adding module comprises: 8-bit adders, 16-bit adders, 32-bit adders and 64-bit adders; the fixed point shift module includes: 8-bit fixed point shift, 16-bit fixed point shift, 32-bit fixed point shift, and 64-bit fixed point shift; the integer shift module includes: 8-bit integer shift, 16-bit integer shift, 32-bit integer shift, and 64-bit integer shift; the integer logic module comprises: and, OR, XOR; the integer comparison module is used for making difference on data input into the ALU operation subunit and judging the size according to the difference value.
FIG. 4 shows a 64-bit ALU operator unit, with data 1 being the first type of source data and data 2 being the second type of source data. The data 1 and the data 2 are data which are input from a register file and participate in calculation, the data 1 can be vector type, scalar type and immediate type, the data 2 can only be vector type, if the data is scalar or immediate type, the data can be expanded into corresponding width according to sew, and finally the data of each channel is 64-bit width; the ALU operation subunit controls the input data according to the executed instruction, and selects a module from 7 modules to calculate the correct result; the instructions realized by the integer adding module are shown in a table 1; the instruction realized by the fixed point saturation adding module is shown in a table 1; the instruction realized by the fixed point average adding module is shown in a table 1; the instructions realized by the fixed point shifting module are shown in table 1; the instructions realized by the integer shift module are shown in a table 1; the integer logic module internally comprises 3 logic operations, and the implemented instructions are shown in a table 1; the integer comparison module compares whether the two source data have the same sign and different signs, directly compares the magnitude and the same sign, calls an integer addition module for the two source data to compare the magnitude, and the realized instruction is shown in table 1; and 7 modules calculate simultaneously, and finally, a correct result is selected according to the instruction to be output.
TABLE 1
Type of instruction Instruction name Instruction description Compatible width Execution module
Integer arithmetic operations VADD,VADC, VREDSUM, VWREDSUMU, VWREDSUM, VSUB,VSBC, VRSUB The addition, the carry-over addition,reduction and summation, additionWide reduction unsignedSum, widen, orThe sum, the subtraction,borrow and subtract and reverse subtract 8\16\32\64 Integer adding module
Fixed point saturation operation VSADDU,VSADD, VSSUBU,VSSUB Saturation with/without signPlus/minus signedReduction of saturation 8\16\32\64 Fixed-point saturation adding dieBlock
Fixed point averaging operation VAADDU,VAADD, VASUBU,VASUB, Non \ signed averagePlus/minus signedAverage decrease 8\16\32\64 Fixed point average adding moduleBlock
Fixed point shift operation VSSRL,VSSRA, VNCLIPU, VNCLIP, Scaling logic \ arithmeticMove right without \ with symbolNumber cutting 8\16\32\64 Fixed point shift module
Integer shift operation VSLL,VSRL, VSRA, VNSRL, VNSRA, Logical left \ right shiftBit, arithmetic right shiftBit, narrowing logicRight arithmetic editing 8\16\32\64 Integer shift module
Integer logic operation VAND,VREDAND, VOR,VREDOR, VXOR,VREDXOR The sum of the sum, the reduction sum,or, a reduction or a,XOR, reduce XOR 8\16\32\64 Integer logic module
Integer compare operation VMIN, VMINU, VMAX, VMAXU, VREDMINU, VREDMIN, VREDMAXU, VREDMAX, VMSEQ, VMSNE, VMSLT, VMSLTU, VMSLE, VMSLEU, VMSGT, VMSGTU, Vmsge,vmsgeu Signed/unsigned minimumValue, presence \ absence of coincidenceThe maximum value of the number is,non \ signed reductionMinimum value, no \Signed reduction maximumThe values of large, equal,inequality, there is \\The number of unsigned bits is less than,with/without sign less thanEqual to, has/does notThe sign is greater than that of the symbol,signed \ unsigned greater thanIs equal to 8\16\32\64 An integer comparison module,Integer adding module
The 8-bit adder in the integer addition module comprises: two selectors and 3 4-bit carry look ahead adders; the 16-bit adder in the integer adding module comprises: two selectors and 3 8-bit selection carry adders; the 32-bit adder in the integer adding module comprises: two selectors and 3 16-bit selection carry adders; the 64-bit adder in the integer adding module comprises: two selectors and 3 32-bit selection carry adders.
The 4-bit adder is composed of 4 1-bit full adders, and the composition mode can be serial carry or parallel carry. If 4 full adders form a 4-bit serial carry adder, the critical path consumes time: 3T +4 (T + T) =11T, wherein the delay of the AND, OR and non-primary gate is T; the nand, nor gate delay is 2T; the xor gate delay is 3T; if 4 full adders form a 4-bit parallel carry adder, the critical path consumes time: 3T + T = 8T. Therefore, in this embodiment, the 4-bit adder adopts a parallel connection manner, the parallel adder adopts a carry look ahead adder, and the structure is as shown in fig. 5, and the carry input signal of each bit full adder is obtained in advance through the logic circuit, so that the operation speed can be improved.
The expression of the 4-bit carry look ahead adder is as follows:
Figure SMS_14
(1)
Figure SMS_15
(2)
Figure SMS_16
(3)
generating carry description by the above (1), (2) and (3):
Figure SMS_17
(4)
Figure SMS_18
(5)
Figure SMS_19
(6)
Figure SMS_20
(7)
Figure SMS_21
(8)
Figure SMS_22
(9)
wherein ,
Figure SMS_23
is the first of a 4-bit carry look-ahead adder
Figure SMS_28
Bit output, 4 bits in total;
Figure SMS_30
for source data of the first kind input to a 4-bit carry look-ahead adder
Figure SMS_25
Bit, 4 in total;
Figure SMS_26
for source data of the second kind input to a 4-bit carry look-ahead adder
Figure SMS_31
Bit, 4 in total;
Figure SMS_32
is as follows
Figure SMS_24
The output of the NOR gates, 3 NOR gates in total,
Figure SMS_27
is as follows
Figure SMS_29
An output of the NOR gate;
Figure SMS_33
is an exclusive or operation.
The multi-bit adder in the integer addition module comprises: the first selector, the second selector, the first multi-bit adder sub-module, the second multi-bit adder sub-module and the third multi-bit adder sub-module, wherein the number of bits of the multi-bit adder is N, and the value of the number of bits of the multi-bit adder sub-module which is N/2,N comprises: 8. 16, 32 and 64, the multi-bit adder sub-module comprising: carry look ahead adder and carry select adder; data input to the N-bit adder in the integer addition module includes: the first type source data and the second type source data are both N bits, the first type source data and the second type source data after entering the N-bit adder are divided into high N/2 bits and low N/2 bits, and the low N/2 bits of the first type source data, the high N/2 bits of the first type source data, the low N/2 bits of the second type source data and the high N/2 bits of the second type source data are obtained; the inputs to the first N/2-bit adder sub-module include: the low N/2 bits of the first kind of source data, the low N/2 bits of the second kind of source data and the input carry bit, and the output comprises: the first output carry sum outputs a low N/2 bit sum; the inputs to the second N/2-bit adder sub-module include: the high N/2 bit of the first kind of source data, the high N/2 bit of the second kind of source data and the low carry bit, and the output comprises: the second output carry and the first output high N/2 bit sum; the inputs to the third N/2-bit adder sub-module include: the high N/2 bit of the first kind of source data, the high N/2 bit and the high carry of the second kind of source data, the output of which comprises: the third output carry and the second output high N/2 bit sum; the input of the first selector comprises: the first output carry, the second output carry and the third output carry, and the output thereof includes: selecting an output carry; the input of the second selector comprises: a first output carry, a first output high N/2 bit sum, and a second output high N/2 bit sum, the outputs of which include: selecting and outputting high N/2 bit sum; and the high N/2 bit sum and the low N/2 bit sum are selectively output and spliced, and the calculation result of the N-bit adder is obtained by combining the selective output carry.
The N-bit adder in the input integer adding module specifically includes the following types:
as shown in fig. 6, the structure of the 8-bit adder in the integer adding module includes: a first selector, a second selector, a first 4-bit carry look ahead adder, a second 4-bit carry look ahead adder and a third 4-bit carry look ahead adder; the data input to the 8-bit adder in the integer addition module includes: the method comprises the steps of obtaining a first type of source data, a second type of source data and an input carry, wherein the first type of source data and the second type of source data are both 8 bits, the first type of source data and the second type of source data after entering an 8-bit adder are divided into high 4 bits and low 4 bits, and the low 4 bits of the first type of source data, the high 4 bits of the first type of source data, the low 4 bits of the second type of source data and the high 4 bits of the second type of source data are obtained; the inputs of the first 4-bit carry look ahead adder include: the low-order 4 bits of the first kind of source data, the low-order 4 bits of the second kind of source data and the input carry bit, and the output comprises: the first output carry sum outputs a lower 4-bit sum; the inputs of the second 4-bit carry look ahead adder include: the high 4 bits of the first kind of source data, the high 4 bits of the second kind of source data and the low carry bit, and the output comprises: the second output carry and the first output high 4-bit sum; the inputs of the third 4-bit carry look ahead adder include: the high 4 bits of the first kind of source data, the high 4 bits of the second kind of source data and the high carry bit, and the output comprises: the third output carry and the second output high 4-bit sum; the input of the first selector comprises: the first output carry, the second output carry and the third output carry, and the output thereof includes: selecting an output carry; the input of the second selector comprises: a first output carry, a first output high 4-bit sum and a second output high 4-bit sum, the output of which comprises: selecting and outputting a high 4-bit sum; and the high 4-bit sum and the low 4-bit sum are selectively output and spliced, and the calculation result of the 8-bit adder is obtained by combining the selective output carry.
Specifically, in fig. 6, 8-bit data is divided into 2 groups, a low 4-bit [3:0] and a high 4-bit [7:4], where a carry of the low 4-bit is a definite input carry Cin, and when a carry is not input, a default carry is 0; the carry bit of the upper 4 bits is an indeterminate value, the carry bit is either 0 or 1, and the default carry bit is 0. Simultaneously calculating 1 low 4-bit addition and 2 high 4-bit additions to obtain 1 low 4-bit sum S [3:0], carry C3 and 2 high 4-bit sum [7:4] and carry C7; then according to the value of the low 4-bit addition result C3, selecting a high 4-bit addition carry C7, wherein C3=0 selects C7 and S [7:4] of the 1 st high 4-bit adder, and C3=1 selects C7 and S [7:4] of the 2 nd high 4-bit adder; splicing S [3:0] and S [7:4] together to form a summation result; finally, carry Cout = C7 and sum result S [7:0] are output.
As shown in fig. 7, the 16-bit adder in the integer adding module includes: the carry selector comprises a first selector, a second selector, a first 8-bit carry selection adder, a second 8-bit carry selection adder and a third 8-bit carry selection adder; the data input to the 16-bit adder in the integer addition module includes: the method comprises the steps of obtaining a first type of source data, a second type of source data and an input carry, wherein the first type of source data and the second type of source data are both 16 bits, the first type of source data and the second type of source data after entering a 16-bit adder are divided into a high 8 bit and a low 8 bit, and the low 8 bit of the first type of source data, the high 8 bit of the first type of source data, the low 8 bit of the second type of source data and the high 8 bit of the second type of source data are obtained; the inputs of the first 8-bit select carry adder comprise: the low-order 8 bits of the first kind of source data, the low-order 8 bits of the second kind of source data and the input carry bit, and the output comprises: the first output carry sum outputs a lower 8-bit sum; the inputs of the second 8-bit select carry adder include: the high 8 bits of the first type of source data, the high 8 bits of the second type of source data and the low carry bit, and the output comprises: the second output carry and the first output high 8-bit sum; the inputs of the third 8-bit select carry adder include: the high 8 bits of the first type of source data, the high 8 bits of the second type of source data and the high carry bit, and the output comprises: the third output carry and the second output high 8-bit sum; the input of the first selector comprises: the first output carry, the second output carry and the third output carry, and the output thereof includes: selecting an output carry; the input of the second selector comprises: a first output carry, a first output high 8-bit sum, and a second output high 8-bit sum, the output of which comprises: selecting and outputting a high 8-bit sum; and the high 8-bit sum and the low 8-bit sum are selectively output and spliced, and the calculation result of the 16-bit adder is obtained by combining with the selective output carry.
Specifically, in fig. 7, 16-bit data is divided into 2 groups, a lower 8-bit [7:0] and an upper 8-bit [15 ] and the carry of the lower 8-bit is a definite input carry Cin, and when no carry is input, the default carry is 0; the carry of the upper 8 bits is an indeterminate value, the carry is either 0 or 1, and the default carry is 0. Simultaneously calculating 1 low 8-bit addition and 2 high 8-bit additions to obtain 1 low 8-bit sum S [7:0], carry C7, 2 high 8-bit sum [15 ] and carry C15; according to the value of the lower 8-bit addition result C7, selecting the upper 8-bit addition carry C15, wherein C7=0 selects C15 and S [15 ] of the 1 st upper 8-bit adder, and C7=1 selects C15 and S [15 ] of the 2 nd upper 8-bit adder; then splicing S [7:0] and S [15 ] together to form a summation result; finally, the output carry Cout = C15 and the summation result S [ 15.
As shown in fig. 8, the data input to the 32-bit adder in the integer addition module includes: the method comprises the steps of obtaining a first type of source data, a second type of source data and an input carry, wherein the first type of source data and the second type of source data are both 32 bits, the first type of source data and the second type of source data after entering a 32-bit adder are divided into high 16 bits and low 16 bits, and the low 16 bits of the first type of source data, the high 16 bits of the first type of source data, the low 16 bits of the second type of source data and the high 16 bits of the second type of source data are obtained; the inputs of the first 16-bit select carry adder include: the low-order 16 bits of the first kind of source data, the low-order 16 bits of the second kind of source data and the input carry bit, and the output comprises: the first output carry sum outputs a lower 16-bit sum; the inputs of the second 16-bit carry select adder include: the high 16 bits of the first kind of source data, the high 16 bits of the second kind of source data and the low carry bit, and the output comprises: the second output carry and the first output high 16-bit sum; the inputs of the third 16-bit select carry adder include: the high 16 bits of the first kind of source data, the high 16 bits of the second kind of source data and the high carry bit, and the output comprises: the third output carry and the second output high 16-bit sum; the input of the first selector comprises: the first output carry, the second output carry and the third output carry, and the output thereof includes: selecting an output carry; the input of the second selector comprises: a first output carry, a first output high 16-bit sum and a second output high 16-bit sum, the output of which comprises: selecting and outputting a high 16-bit sum; and the high 16-bit sum and the low 16-bit sum are selected and output to be spliced, and the calculation result of the 32-bit adder is obtained by combining the selection and output of carry.
Specifically, in fig. 8, 32-bit data are divided into 2 groups, a lower 16-bit [15 ] and a higher 16-bit [31 ]; the carry bit of the upper 16 bits is an indeterminate value, the carry bit is either 0 or 1, and the default carry bit is 0. Simultaneously calculating 1 lower 16-bit addition and 2 upper 16-bit additions to obtain 1 lower 16-bit sum S [15 ] and carry C15 and 2 upper 16-bit sum [31 ] and carry C31; then according to the value of the lower 16-bit addition result C15, selecting the upper 16-bit addition carry C31, C15=0 selecting C31 and S [31 ] of the 1 st upper 16-bit adder, C15=1 selecting C31 and S [31 ] of the 2 nd upper 16-bit adder; then splicing S [15 ]; the final output carry Cout = C31 and the sum result S [ 31.
As shown in FIG. 9, the data input to the 64-bit adder in the integer addition block includes: the first-class source data and the second-class source data are 64 bits, and the first-class source data and the second-class source data after entering the 64-bit adder are divided into high 32 bits and low 32 bits to obtain the low 32 bits of the first-class source data, the high 32 bits of the first-class source data, the low 32 bits of the second-class source data and the high 32 bits of the second-class source data; the inputs of the first 32-bit select carry adder comprise: the low-order 32 bits of the first kind of source data, the low-order 32 bits of the second kind of source data and the input carry bit, and the output comprises: the first output carry sum outputs a low-order 32-bit sum; the inputs of the second 32-bit carry-select adder comprise: the high-order 32 bits of the first kind of source data, the high-order 32 bits of the second kind of source data and the low-order carry bit, and the output comprises: the second output carry and the first output high 32-bit sum; the inputs of the third 32-bit select carry adder include: the high 32 bits of the first kind of source data, the high 32 bits of the second kind of source data and the high carry bit, and the output comprises: the third output carry and the second output high 32-bit sum; the input of the first selector comprises: the first output carry, the second output carry and the third output carry, and the output thereof includes: selecting an output carry; the input of the second selector comprises: a first output carry, a first output high 32-bit sum and a second output high 32-bit sum, the output of which comprises: selecting and outputting a high 32-bit sum; and the high 32-bit sum and the low 32-bit sum are selected and output to be spliced, and the calculation result of the 64-bit adder is obtained by combining the selection and output of carry.
Specifically, in fig. 9, the 64-bit data is divided into 2 groups, a lower 32-bit [31 ] and a higher 32-bit [63 ] and the carry of the lower 32-bit is a definite input carry Cin, and when no carry is input, the default carry is 0; the carry bit of the upper 32 bits is an indeterminate value, the carry bit is either 0 or 1, and the default carry bit is 0. Simultaneously calculating 1 lower 32-bit addition and 2 upper 32-bit additions to obtain 1 lower 32-bit sum S [31 0], carry C31, 2 upper 32-bit sum [63 ]; then according to the value of the lower 32-bit addition result C31, selecting the upper 32-bit addition carry C63, C31=0 selecting C63 and S [63 ] of the 1 st upper 32-bit adder, C31=1 selecting C63 and S [63 ] of the 2 nd upper 32-bit adder; then splicing S [31 ] and S [63 ] together to form a summation result; finally, the carry Cout = C63 is output and the sum result S [ 63.
Specifically, in fig. 4, the fixed point saturating adder module (e) of the 64-bit ALU operator subunit, 8-bit adder (e 8) comprises: 1 integer 8-bit selection adder (d 8), 1 saturation operation module; the 16-bit adder (e 16) includes: 1 integer 16-bit selection adder (d 16), 1 saturation operation module; the 32-bit adder (e 32) includes: a 32-bit selection adder (d 32) of 1 integer, and 1 saturation operation module; the 64-bit adder (e 64) includes: 1 integer 64-bit selection adder (d 64), 1 saturation operation module. The saturation operation module judges whether the result of the adder overflows or not, and outputs a saturation value when the result of the adder overflows; and when the adder result does not overflow, outputting the adder summation value. The output result is the result of the addition and saturation calculation.
A fixed point average adder module (f) for a 64-bit ALU operator subunit, the 8-bit adder (f 8) comprising: 1 integer 8-bit selection adder (d 8), 1 shifter, 1 rounding module; the 16-bit adder (f 16) includes: 1 integer 16-bit selection adder (d 16), 1 shifter, 1 rounding module; the 32-bit adder (f 32) includes: a 32-bit selection adder (d 32) of 1 integer, 1 shifter, 1 rounding module; the 64-bit adder (f 64) includes: 1 integer 64-bit select adder (d 64), 1 shifter, 1 rounding module. The shifter is used for the average calculation of the adder output value, i.e. a fixed moving shift. The rounding module determines whether the average addition value is rounded or rounded according to a fixed-point rounding mode. The output result is the result of the add, shift by 1 bit, and round calculation.
A fixed point shift module (g) for a 64-bit ALU operator subunit, the 8-bit fixed point shift (g 8) comprising: 18 bit shifter and 1 rounding module; the 16-bit fixed point shift (g 16) includes: 1 16 bit shifter and 1 rounding module; the 32-bit fixed point shift (g 32) includes: 1 32 bit shifter and 1 rounding module; the 64-bit fixed point shift (g 64) includes: 1 64 bit shifter and 1 rounding module. The shifter is used for shifting. And the rounding module judges whether the shifted data is rounded or rounded according to the fixed-point rounding mode. The output result is the shifted result.
An integer shift module (h) for a 64-bit ALU operand unit, the 8-bit integer shift (h 8) comprising: 1 shifter is used for shifting; the 16 integer shift (h 16) comprises: 1 shifter is used for shifting; the 32 integer shift (h 32) comprises: 1 shifter is used for shifting; the 64 integer shift (h 64) includes: 1 shifter is used for shifting. The integer shifting module (h) directly shifts the data, and the shifted data are directly discarded. The output result is the shifted result.
The integer logic blocks (i) of the 64-bit ALU operation subunit, in which all bit operations are performed, contain AND, OR, XOR functional sub-blocks. The output result is the result after the logical calculation.
And (d) an integer addition module is called in the integer comparison module (j) of the 64-bit ALU operation subunit, the difference between the two numbers is directly made, and the difference value is compared with 0, so that the maximum value, the minimum value or the values greater than or less than or equal to the maximum value and the minimum value are judged. If the executed integer comparison instruction is executed, the output result is a Boolean value, if the comparison result is true, 1 is output, otherwise, 0 is output; if an integer most significant instruction is executed, the output result is the value of the source operand.
The output of the 64-bit ALU operation subunit, 7 modules in the 64-bit ALU circuit (c), selects the result of one module according to the operation type of the instruction, and outputs the result, which is the result of the calculation of the instruction.
The ALU processing system comprises two data types of pure integer (fixed point, namely a decimal point is at the end of the number, and the number after the decimal point is 0, which is generally directly expressed as an integer) and pure decimal (the unit bit is 0, the decimal point is after the unit bit, and the first bit after the decimal point is not 0), realizes two instructions of integer and fixed point instruction, and supports 8, 16, 32 and 64-bit element bit width.
The CPU internally transmits a vector instruction which is transmitted to a decoder for decoding, the instruction has a specific code, the instruction code comprises a data type (element bit width of vector source data, an arithmetic unit and the like) which indicates that the instruction carries out integer or fixed point operation, and different instructions correspond to different instruction codes.
The ALU operator subunit divides the vector instructions into 7 general classes of operator unit computations: integer addition, fixed point saturation addition, fixed point average addition, fixed point shift, integer logic, and integer comparison calculation. And selecting an operation module in 7 modules according to the instruction, wherein the operation modules with different bits in the operation module can perform parallel calculation on the incoming data, and the operation modules can shoot a result and select different results according to the instruction.
Before data enters a register file, the data is subjected to shuffle operation, the execution of a vector instruction is supported, the parallelism is accelerated, and meanwhile, the execution of the width-spanning operation by the width-spanning instruction is facilitated.
The technical scheme of the embodiment of the invention at least has the following advantages and beneficial effects:
1. the fixed-point pure decimal number does not need to be calculated as a floating-point number, does not calculate an exponent part and a mantissa part, and directly enters a pure decimal number operation unit to execute one beat of result.
2. The multi-channel speed is high, data are stored in 4 channels respectively, parallel processing is achieved, the operation speed is improved, and meanwhile 256 bits of data can be calculated.
3. All adopt parallel structure to calculate, the computational element is fast, calculates a plurality of results, selects the result according to the instruction.
4. The compatibility is good, and the arithmetic unit is compatible with a plurality of width arithmetic of 8, 16, 32 and 64.
5. The expansibility has strong foresight, the 3-bit data control width of one privilege register is set, currently, only 2-bit control element bit width is actually used for 3 bits, and future data can be expanded to 128, 256, 512 and 1024 widths.
6. The calculation data is wide in type and comprises calculation of vectors and vectors, calculation of vectors and scalars and calculation of immediate numbers.
7. The rounding mode is diversified, and 4 kinds of fixed-point rounding modes are supported: equal distance round up (rnu), equal distance even round (rne), direct truncation (rdn), and odd round (rod).
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A vector fixed point ALU processing system, comprising: the system comprises a decoder, an instruction transmitting subsystem, a register file and an execution subsystem;
the decoder is used for decoding the vector instruction to obtain a decoded vector instruction; the transmitting instruction subsystem is used for receiving the decoding vector instruction, determining an execution subsystem for executing operation according to the operation type in the decoding vector instruction, and sending address data in the decoding vector instruction to the register file; the execution subsystem is used for executing a data operation instruction in the decoding vector instruction; the register file is used for storing operation results.
2. The vector fixed-point ALU processing system of claim 1, wherein said decode vector instruction is 32-bits in length, comprising: vector instruction type, accounting for 7 bits; the address of the target register occupies 5 bits; a funct3 opcode, occupying 3 bits; the address of the first source register occupies 5 bits; the address of the second source register occupies 5 bits; the mask operation is enabled and occupies 1 bit; the funct6 opcode takes 6 bits.
3. The vector fixed-point ALU processing system of claim 1, wherein said execution subsystem comprises: the device comprises a first channel unit, a second channel unit, a third channel unit and a fourth channel unit;
an ALU operator subunit is included in each channel unit.
4. A vector fixed-point ALU processing system as defined in claim 3, wherein said execution subsystem retrieves data to be processed from a register file and divides said data to be processed into 4 portions, each of said portions being fed into a channel unit, each of said 4 portions being fed into a different channel unit.
5. The vector fixed-point ALU processing system of claim 4, wherein in each channel unit, the ALU operation subunit performs an operation on the data to be processed in the channel unit;
the operation types include: integer arithmetic, integer logic, integer move, integer reduce, fixed point average add-subtract, fixed point saturated add-subtract, fixed point logic, and fixed point narrow instructions.
6. The vector fixed-point ALU processing system of claim 3, wherein said ALU operations subunit comprises: the device comprises an integer addition module, a fixed point average addition module, a fixed point saturation addition module, a fixed point shift module, an integer logic module and an integer comparison module;
the integer adding module comprises: 8-bit adders, 16-bit adders, 32-bit adders and 64-bit adders; the fixed point average adding module comprises: 8-bit adders, 16-bit adders, 32-bit adders and 64-bit adders; the fixed point saturation adding module comprises: 8-bit adders, 16-bit adders, 32-bit adders and 64-bit adders; the fixed point shift module includes: 8-bit fixed point shift, 16-bit fixed point shift, 32-bit fixed point shift, and 64-bit fixed point shift; the integer shift module includes: 8-bit integer shift, 16-bit integer shift, 32-bit integer shift, and 64-bit integer shift; the integer logic module comprises: and, OR, XOR; the integer comparison module is used for making difference on data input into the ALU operation subunit and judging the size according to the difference value.
7. The vector fixed-point ALU processing system of claim 6, wherein said 8-bit adder in said integer addition module comprises: two selectors and 3 4-bit carry look ahead adders; the 16-bit adder in the integer adding module comprises: two selectors and 3 carry-select adders with 8 bits; the 32-bit adder in the integer adding module comprises: two selectors and 3 16-bit selection carry adders; the 64-bit adder in the integer adding module comprises: two selectors and 3 32-bit select carry adders.
8. The vector fixed-point ALU processing system of claim 7, wherein said 4-bit carry-look-ahead adder has the operational expression:
Figure QLYQS_1
Figure QLYQS_2
wherein ,
Figure QLYQS_4
is the first of a 4-bit carry look-ahead adder
Figure QLYQS_6
Bit output, 4 bits in total;
Figure QLYQS_9
for source data of the first kind input to a 4-bit carry look-ahead adder
Figure QLYQS_3
Bit, 4 in total;
Figure QLYQS_8
for source data of the second kind input to a 4-bit carry look-ahead adder
Figure QLYQS_10
Bit, 4 in total;
Figure QLYQS_12
is as follows
Figure QLYQS_5
The output of the NOR gates, the NOR gates are 3 in total,
Figure QLYQS_7
is as follows
Figure QLYQS_11
An output of the NOR gate;
Figure QLYQS_13
is an exclusive or operation.
9. The vector fixed-point ALU processing system of claim 6, wherein the multi-bit adder in the integer addition block comprises: a plurality of selectors and a plurality of multi-bit adder sub-modules;
each multi-bit adder sub-module is used for performing addition operation on input source data according to carry to obtain output carry and output data;
the selector is used for selecting the output carry and the output data of the multi-bit adder submodule.
10. The vector fixed-point ALU processing system of claim 9, wherein said number of selectors is 2 and said number of multi-bit adder sub-modules is 3.
11. The vector fixed-point ALU processing system of claim 7, wherein the multi-bit adder in the integer addition block comprises: the first selector, the second selector, the first multi-bit adder sub-module, the second multi-bit adder sub-module and the third multi-bit adder sub-module, wherein the number of bits of the multi-bit adder is N, and the value of the number of bits of the multi-bit adder sub-module which is N/2,N comprises: 8. 16, 32 and 64, the multi-bit adder sub-module comprising: carry look ahead adder and carry select adder;
data input to the N-bit adder in the integer addition module includes: the first-class source data and the second-class source data are both N bits, and the first-class source data and the second-class source data after entering the N-bit adder are divided into high N/2 bits and low N/2 bits to obtain low N/2 bits of the first-class source data, high N/2 bits of the first-class source data, low N/2 bits of the second-class source data and high N/2 bits of the second-class source data;
the inputs to the first N/2 bit adder sub-module include: the low N/2 bits of the first kind of source data, the low N/2 bits of the second kind of source data and the input carry bit, and the output comprises: the first output carry sum outputs a low N/2 bit sum; the inputs to the second N/2-bit adder sub-module include: the high N/2 bit of the first kind of source data, the high N/2 bit and the low carry of the second kind of source data, and the output comprises: the second output carry and the first output high N/2 bit sum; the inputs to the third N/2-bit adder sub-module include: the high N/2 bit of the first kind of source data, the high N/2 bit and the high carry of the second kind of source data, the output of which comprises: the third output carry and the second output high N/2 bit sum; the input of the first selector comprises: the first output carry, the second output carry and the third output carry, and the output thereof includes: selecting an output carry; the input of the second selector comprises: a first output carry, a first output high N/2 bit sum, and a second output high N/2 bit sum, the outputs of which include: selecting and outputting high N/2 bit sum; and the high N/2 bit sum and the low N/2 bit sum are selectively output and spliced, and the calculation result of the N-bit adder is obtained by combining the selective output carry.
12. The vector fixed-point ALU processing system of claim 1, wherein said address data comprises: the address of the source register and the address of the destination register.
CN202310070128.4A 2023-02-07 2023-02-07 Vector fixed point ALU processing system Active CN115826910B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310070128.4A CN115826910B (en) 2023-02-07 2023-02-07 Vector fixed point ALU processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310070128.4A CN115826910B (en) 2023-02-07 2023-02-07 Vector fixed point ALU processing system

Publications (2)

Publication Number Publication Date
CN115826910A true CN115826910A (en) 2023-03-21
CN115826910B CN115826910B (en) 2023-05-02

Family

ID=85520863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310070128.4A Active CN115826910B (en) 2023-02-07 2023-02-07 Vector fixed point ALU processing system

Country Status (1)

Country Link
CN (1) CN115826910B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1894659A (en) * 2003-12-09 2007-01-10 Arm有限公司 Data processing apparatus and method for moving data between registers and memory
CN102750133A (en) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD
CN103885919A (en) * 2014-03-20 2014-06-25 北京航空航天大学 Multi-DSP and multi-FPGA parallel processing system and implement method
US20140208078A1 (en) * 2013-01-23 2014-07-24 International Business Machines Corporation Vector checksum instruction
CN105373367A (en) * 2015-10-29 2016-03-02 中国人民解放军国防科学技术大学 Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector
US20170269931A1 (en) * 2016-03-16 2017-09-21 National Taiwan University Method and Computing System for Handling Instruction Execution Using Affine Register File on Graphic Processing Unit
CN107329936A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing neural network computing and matrix/vector computing
CN108369511A (en) * 2015-12-18 2018-08-03 英特尔公司 Instruction for the storage operation that strides based on channel and logic
CN108369573A (en) * 2015-12-18 2018-08-03 英特尔公司 The instruction of operation for multiple vector elements to be arranged and logic
CN108459840A (en) * 2018-02-14 2018-08-28 中国科学院电子学研究所 A kind of SIMD architecture floating-point fusion point multiplication operation unit
CN109388427A (en) * 2017-08-11 2019-02-26 龙芯中科技术有限公司 Vector processing method, vector processing unit and microprocessor
CN110036369A (en) * 2017-07-20 2019-07-19 上海寒武纪信息科技有限公司 A kind of calculation method and Related product
CN111913746A (en) * 2020-08-31 2020-11-10 中国人民解放军国防科技大学 Design method of low-overhead embedded processor

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1894659A (en) * 2003-12-09 2007-01-10 Arm有限公司 Data processing apparatus and method for moving data between registers and memory
CN102750133A (en) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD
US20140208078A1 (en) * 2013-01-23 2014-07-24 International Business Machines Corporation Vector checksum instruction
CN103885919A (en) * 2014-03-20 2014-06-25 北京航空航天大学 Multi-DSP and multi-FPGA parallel processing system and implement method
CN105373367A (en) * 2015-10-29 2016-03-02 中国人民解放军国防科学技术大学 Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector
CN108369573A (en) * 2015-12-18 2018-08-03 英特尔公司 The instruction of operation for multiple vector elements to be arranged and logic
CN108369511A (en) * 2015-12-18 2018-08-03 英特尔公司 Instruction for the storage operation that strides based on channel and logic
US20170269931A1 (en) * 2016-03-16 2017-09-21 National Taiwan University Method and Computing System for Handling Instruction Execution Using Affine Register File on Graphic Processing Unit
CN107329936A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing neural network computing and matrix/vector computing
CN110036369A (en) * 2017-07-20 2019-07-19 上海寒武纪信息科技有限公司 A kind of calculation method and Related product
CN109388427A (en) * 2017-08-11 2019-02-26 龙芯中科技术有限公司 Vector processing method, vector processing unit and microprocessor
CN108459840A (en) * 2018-02-14 2018-08-28 中国科学院电子学研究所 A kind of SIMD architecture floating-point fusion point multiplication operation unit
CN111913746A (en) * 2020-08-31 2020-11-10 中国人民解放军国防科技大学 Design method of low-overhead embedded processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
庞业勇: "面向在线时间序列预测的KAF向量处理器研究" *

Also Published As

Publication number Publication date
CN115826910B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
CN110036368B (en) Apparatus and method for performing arithmetic operations to accumulate floating point numbers
US7797363B2 (en) Processor having parallel vector multiply and reduce operations with sequential semantics
US6377970B1 (en) Method and apparatus for computing a sum of packed data elements using SIMD multiply circuitry
US5844830A (en) Executing computer instrucrions by circuits having different latencies
US8224883B2 (en) Packed add-subtract operation in a microprocessor
US7593978B2 (en) Processor reduction unit for accumulation of multiple operands with or without saturation
JP3729881B2 (en) Circuit and method for performing parallel addition and averaging
EP1403762A2 (en) Processor executing simd instructions
US5862065A (en) Method and circuit for fast generation of zero flag condition code in a microprocessor-based computer
US20060277244A1 (en) Method and apparatus for formatting numbers in microprocessors
KR19980041798A (en) Module Computation Structure Supporting Commands for Image Processing
US20190095175A1 (en) Arithmetic processing device and arithmetic processing method
US7013321B2 (en) Methods and apparatus for performing parallel integer multiply accumulate operations
EP0478745A1 (en) High performance interlock collapsing scism alu apparatus
CN116820393A (en) Multi-precision multiply-add unit supporting deep learning instruction and application method thereof
CN107851007B (en) Method and apparatus for comparison of wide data types
US20060218380A1 (en) Add-shift-round instruction with dual-use source operand for DSP
CN115826910B (en) Vector fixed point ALU processing system
CN117111881A (en) Mixed precision multiply-add operator supporting multiple inputs and multiple formats
US7406590B2 (en) Methods and apparatus for early loop bottom detection in digital signal processors
JPS58186840A (en) Data processor
Senthilvelan et al. Flexible arithmetic and logic unit for multimedia processing
US20060064451A1 (en) Arithmetic circuit
Huang et al. Hardware support for arithmetic units of processor with multimedia extension
JP2024056266A (en) Processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant