CN115826910B - Vector fixed point ALU processing system - Google Patents

Vector fixed point ALU processing system Download PDF

Info

Publication number
CN115826910B
CN115826910B CN202310070128.4A CN202310070128A CN115826910B CN 115826910 B CN115826910 B CN 115826910B CN 202310070128 A CN202310070128 A CN 202310070128A CN 115826910 B CN115826910 B CN 115826910B
Authority
CN
China
Prior art keywords
bit
adder
bits
carry
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310070128.4A
Other languages
Chinese (zh)
Other versions
CN115826910A (en
Inventor
李霞
周琦
贾筠
李晋
王荣丰
霍旭东
杜鹰
胡波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Sunway Technology Co ltd
Original Assignee
Chengdu Sunway Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sunway Technology Co ltd filed Critical Chengdu Sunway Technology Co ltd
Priority to CN202310070128.4A priority Critical patent/CN115826910B/en
Publication of CN115826910A publication Critical patent/CN115826910A/en
Application granted granted Critical
Publication of CN115826910B publication Critical patent/CN115826910B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention provides an ALU processing system of vector fixed point, comprising: the system comprises a decoder, an instruction transmitting subsystem, a register file and an execution subsystem; according to the invention, a plurality of operation modules with different bit widths are arranged in the ALU operation subunit for parallel processing, so that the operation speed is improved, and the problems that the operation unit generally performs calculation after selecting the data bit width (namely the element bit width), the calculation structure has multiple layers and the calculation speed is low when calculating various bit width data are involved are solved. The fixed point decimal is directly processed through the fixed point average adding module, the fixed point saturation adding module and the fixed point shifting module, the fixed point pure decimal does not need to be calculated as a floating point number, and an exponent and mantissa part are not calculated, so that the problems of complex structure, high calculation cost and long calculation period of the floating point calculation unit are solved.

Description

Vector fixed point ALU processing system
Technical Field
The invention relates to the technical field of integrated circuits, in particular to an ALU processing system for vector fixed point.
Background
The semiconductor and integrated circuit industries are important core industries in the information age, and are industries supporting strategic, basic and pilot developments of the economic society. With the acceleration of information processing, the performance of integrated circuit chips is becoming higher and higher, and the data processing mode of the chip also greatly determines the performance of the chip. For large amounts of data, parallel processing methods may be employed. Vectorizing data is one type of parallel processing data, each data set is in one group of vectors, and the effect of multi-data processing can be achieved by calculating two groups of vector source data. Vector processing is an effective method for improving the batch processing data of a processor and improving the performance, and the key part is the design of an operation unit. The data processed by the arithmetic logic unit of most CPUs generally has two data types, one is an integer and the other is a floating point. The fixed point number comprises a pure integer and a pure decimal, wherein generally the pure decimal enters a floating point operation unit for calculation, and the pure integer enters an integer operation unit for calculation.
Problems of the prior art:
1. the arithmetic unit for processing data in the chip CPU generally selects the bit width of the data (namely the bit width of an element) and then calculates the data, so that the level of the calculation structure is more, and the calculation speed is slow when the calculation of various bit widths is related.
2. The data are divided into fixed points and floating points, the fixed points are divided into three cases of pure integers, pure decimal numbers and integer plus decimal numbers, the calculation of most pure decimal numbers adopts floating point units for calculation, the structure of the floating point calculation units is complex, the calculation cost is high, and the calculation period is long.
3. The floating point is divided into three parts of a sign bit, an exponent part and a mantissa part, the algorithm is mapped to a hardware circuit, the number of the adders is increased, the pure decimal number is calculated by two additions, and the calculation time is increased.
Disclosure of Invention
In view of the above-mentioned shortcomings in the prior art, the present invention provides: an ALU processing system for vector pointing solves the problems of the prior art in the background.
In order to achieve the aim of the invention, the invention adopts the following technical scheme: an ALU processing system for vector pointing, comprising: the system comprises a decoder, an instruction transmitting subsystem, a register file and an execution subsystem;
the decoder is used for decoding the vector instruction to obtain a decoded vector instruction; the transmitting instruction subsystem is used for receiving the decoding vector instruction, determining an executing subsystem for executing operation according to the operation type in the decoding vector instruction, and transmitting address data in the decoding vector instruction to the register file; the execution subsystem is used for executing a data operation instruction in the decoding vector instruction; the register file is used for storing operation results.
Further, the coding vector instruction is 32 bits in length, including: vector instruction type, 7 bits; the address of the target register occupies 5 bits; the function 3 operation code occupies 3 sites; the address of the first source register occupies 5 bits; the address of the second source register occupies 5 bits; masking operation is enabled, taking up 1 bit; the function 6 opcode takes up 6 bits.
Further, the execution subsystem includes: a first channel unit, a second channel unit, a third channel unit, and a fourth channel unit;
one ALU operation subunit is included in each channel unit.
Further, the execution subsystem retrieves the data to be processed from the register file, divides the data to be processed into 4 parts, sends each part of data to be processed into one channel unit, and the channel units into which the 4 parts of data to be processed enter are different.
Further, in each channel unit, an ALU operation subunit performs operation on data to be processed in the channel unit;
the operation types include: integer arithmetic, integer logic, integer movement, integer reduction, fixed point average addition and subtraction, fixed point saturation addition and subtraction, fixed point logic, and fixed point narrowing instructions.
Further, the ALU operation subunit includes: the device comprises an integer adding module, a fixed point average adding module, a fixed point saturation adding module, a fixed point shifting module, an integer logic module and an integer comparison module;
The integer adding module comprises: an 8-bit adder, a 16-bit adder, a 32-bit adder, and a 64-bit adder; the fixed-point average adding module comprises: an 8-bit adder, a 16-bit adder, a 32-bit adder, and a 64-bit adder; the fixed point saturation adding module comprises: an 8-bit adder, a 16-bit adder, a 32-bit adder, and a 64-bit adder; the fixed point shift module comprises: 8-bit fixed point shift, 16-bit fixed point shift, 32-bit fixed point shift, and 64-bit fixed point shift; the integer shift module includes: 8-bit integer shift, 16-bit integer shift, 32-bit integer shift, and 64-bit integer shift; the integer logic module includes: and, or, exclusive or; the integer comparison module is used for making a difference on the data input to the ALU operation subunit and judging the size according to the difference.
Further, the 8-bit adder in the integer adder module includes: two selectors and 3 4-bit carry-lookahead adders; the 16-bit adder in the integer adder module comprises: two selectors and 3 8-bit select carry adders; the 32-bit adder in the integer adder module comprises: two selectors and 3 16-bit select carry adders; the 64-bit adder in the integer adder module comprises: two selectors and 3 32-bit select carry adders.
Further, the operation expression of the 4-bit carry-lookahead adder is as follows:
Figure SMS_1
Figure SMS_2
wherein ,
Figure SMS_5
the 4 bit carry lookahead adder>
Figure SMS_7
Bit output, 4 bits total; />
Figure SMS_10
First kind of source data for input 4-bit carry-lookahead adder>
Figure SMS_4
Bits, 4 total bits; />
Figure SMS_8
Second type of source data for input 4-bit carry-lookahead adder>
Figure SMS_11
Bits, 4 total bits; />
Figure SMS_12
Is->
Figure SMS_3
The outputs of the NOR gates, 3 in total,>
Figure SMS_6
is->
Figure SMS_9
The output of the nor gate; />
Figure SMS_13
Is an exclusive or operation.
Further, the multi-bit adder in the integer adder module includes: the multi-bit adder comprises a first selector, a second selector, a first multi-bit adder sub-module, a second multi-bit adder sub-module and a third multi-bit adder sub-module, wherein the number of bits of the multi-bit adder is N, the number of bits of the multi-bit adder sub-module is N/2, and the N value comprises: 8. 16, 32 and 64, the multi-bit adder sub-module comprises: carry look ahead adder and carry select adder;
the data input to the N-bit adder in the integer adder module includes: the method comprises the steps of obtaining low N/2 bits of first type source data, high N/2 bits of first type source data, low N/2 bits of second type source data and high N/2 bits of second type source data, wherein the first type source data, the second type source data and input carry are N bits;
Inputs to the first N/2 bit adder sub-block include: the low N/2 bits of the first type of source data, the low N/2 bits of the second type of source data, and the input carry, the output of which comprises: the first output carry sum outputs a low N/2 bit sum; inputs to the second N/2 bit adder sub-block include: the high N/2 bits of the first type of source data, the high N/2 bits of the second type of source data, and the low carry, the output of which comprises: the second output carry and the first output high N/2 bit sum; the inputs to the third N/2 bit adder sub-block include: the high N/2 bits of the first type source data, the high N/2 bits of the second type source data and the high carry, the output of which comprises: the third output carries and the second output high N/2 sums; the input of the first selector includes: first, second and third output carries, the outputs of which include: selecting an output carry; the input of the second selector includes: the first output carry, the first output high N/2 bit sum, and the second output high N/2 bit sum, the outputs of which include: selecting and outputting high N/2 bit sums; and selecting and outputting high N/2 bits and splicing with low N/2 bits, and combining the selected and outputted carry to obtain the calculation result of the N-bit adder.
Further, the address data includes: an address of a source register and an address of a destination register.
The technical scheme of the embodiment of the invention has at least the following advantages and beneficial effects:
1. according to the invention, a plurality of operation modules with different bit widths are arranged in the ALU operation subunit for parallel processing, so that the operation speed is improved, and the problems that the operation unit generally performs calculation after selecting the data bit width (namely the element bit width), the calculation structure has multiple layers and the calculation speed is low when calculating various bit width data are involved are solved.
2. The fixed point decimal is directly processed through the fixed point average adding module, the fixed point saturation adding module and the fixed point shifting module, the fixed point pure decimal does not need to be calculated as a floating point number, and an exponent and mantissa part are not calculated, so that the problems of complex structure, high calculation cost and long calculation period of the floating point calculation unit are solved.
Drawings
FIG. 1 is a system block diagram of a vector fixed point ALU processing system;
FIG. 2 is a diagram showing the construction of a decoded vector instruction;
FIG. 3 is a diagram of a multi-channel data parallel distribution;
FIG. 4 is a schematic diagram of a 64-bit ALU operation subunit;
FIG. 5 is a schematic diagram of a 4-bit carry-lookahead adder;
FIG. 6 is a schematic diagram of an 8-bit adder;
FIG. 7 is a schematic diagram of a 16-bit adder;
FIG. 8 is a schematic diagram of a 32-bit adder;
Fig. 9 is a schematic diagram of a 64-bit adder.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
As shown in fig. 1, an ALU processing system for vector pointing, comprising: the system comprises a decoder, an instruction transmitting subsystem, a register file and an execution subsystem;
the decoder is used for decoding the vector instruction to obtain a decoded vector instruction; the transmitting instruction subsystem is used for receiving the decoding vector instruction, determining an executing subsystem for executing operation according to the operation type in the decoding vector instruction, and transmitting address data in the decoding vector instruction to the register file; the execution subsystem is used for executing a data operation instruction in the decoding vector instruction; the register file is used for storing operation results.
The address data includes: an address of a source register and an address of a destination register.
The coding vector instruction is 32 bits in length, including: vector instruction type, 7 bits; the address of the target register occupies 5 bits; the function 3 operation code occupies 3 sites; the address of the first source register occupies 5 bits; the address of the second source register occupies 5 bits; masking operation is enabled, taking up 1 bit; the function 6 opcode takes up 6 bits.
As shown in fig. 2, fig. 2 shows a 3-frame decoded vector instruction, the length of the decoded vector instruction is 32 bits, (i) represents the type of vector instruction, and occupies 7 bits; (II) represents the destination register address, taking up 5 bits; (III) represents that the function 3 opcode takes up 3 bits; (iv) represents the address of the first source register, taking up 5 bits; (V) represents the address of the second source register, taking up 5 bits; (vi) represents mask operation enabled, taking up 1 bit; (VII) represents the function 6 opcode, taking up 6 bits.
The specific functions of each part are as follows:
instruction type (7 bits), 1010111, represents a calculation instruction for selecting an execution subsystem for executing vector instructions, which in the present ALU processing system is always 1010111.
(II) vd target register address (5 bits), the output data of the ALU operation subunit is stored to the target register address, which is the address where the vector register file is located in the present system.
(iii) function 3 opcode (3 bits), in the present system there are 3 cases of function 3, function 3 = 000 representing that the execution subsystem performs vector and vector calculations; funct3=100 means that the execution subsystem performs vector and scalar calculations; funct3=011 means that the execution subsystem performs vector and immediate calculations.
(iv) vs1/rs1/imm [4:0] (5 bits), vs1 being the address of the first source register from which the operand "i.e. the first type of source data" input to the ALU operator unit is given when func3=000; when funct3=100, rs1 indicates that the operand is scalar data; funct3=011 indicates that the operand is an immediate.
(V) vs2 the address of the second source register (5 bits), from which the operand input to the ALU operation subunit, "the second type of source data," is given, in this system, a vector register file address.
(vi) vm masking operation bit (1 bit) indicating whether masking operation is used, vm=0 indicating that masking is used, vm=1 indicating that masking is not used.
(VII) functional 6 opcode (6 bits), indicating that the ALU operator unit selects one of an arithmetic or logic operation to perform, the arithmetic operation being an operation between two numbers, the lower operation result having an effect (carry) on the upper bits; logic operation: the method is bit-wise operation, and the low-order operation result has no influence (no carry) on the high-order; according to the difference of the function 6 operation codes, data are sent to the execution subsystem, and different execution subsystems are selected for operation.
The execution subsystem includes: a first channel unit, a second channel unit, a third channel unit, and a fourth channel unit;
one ALU operation subunit is included in each channel unit.
In each channel unit, an ALU operation subunit performs operation on data to be processed in the channel unit;
the operation types include: integer arithmetic, integer logic, integer movement, integer reduction, fixed point average addition and subtraction, fixed point saturation addition and subtraction, fixed point logic, and fixed point narrowing instructions.
The execution subsystem retrieves the data to be processed from the register file, divides the data to be processed into 4 parts, sends each part of data to be processed into one channel unit, and the channel units into which the 4 parts of data to be processed enter are different.
The data stored in the memory is shuffled, i.e. flower arrangement, into the register file. The data in the two source registers are equally divided into 4 parts, two paths of data are correspondingly arranged in each channel, the first type of source data and the second type of source data are distributed, the data are distributed in a flower arrangement mode, as shown in figure 3, 64 bits, namely 8 bytes of data are calculated at one time in one channel, 256 bits of data, namely 16 bytes of data are calculated at the same time in 4 channels, and the data are executed in parallel, so that the speed of source data operation in the vector registers is accelerated. And correspondingly storing the calculated result in a target register, and also following the flower arrangement rule.
As shown in FIG. 3, the value of byte SEW is determined by the vsew [2:0] field in the privilege register vtype, the data width stored in the source register is determined according to SEW, and the data in the source register is equally divided into a plurality of equal-width data of specified SEW bit width, one SEW bit width data being one element. If sew=64, 64 bits are calculated at a time in one lane, 4 lanes calculate 4 64 bits of data simultaneously; if sew=32, 64 bits are calculated at a time in one lane, and 4 lanes calculate 8 32 bits of data at the same time; if sew=16, 64 bits are calculated at a time in one lane, and 4 lanes calculate 16 bits of data at the same time; if sew=8, 64 bits are calculated at a time in one lane, and 4 lanes calculate 32 8 bits of data at the same time.
If the data of the vector exceeds 256-bit data, a plurality of cycle calculations are required to complete, cycle=the number of bits of the total data/the number of bits calculated for 4 lanes (division rounded up). Examples: if the vector data is 1024 bits of data, 1024/256=4, 4 cycles are required to complete execution. In the first cycle, 4 channels simultaneously calculate 256 bits of data of [255:0 ]; in the second cycle, 4 channels simultaneously calculate 256 bits of data [511:256 ]; in the third cycle, 4 channels simultaneously calculate 256 bits of data of [767:512 ]; in the fourth cycle, 4 lanes compute 256 bits of data at the same time [1023:768 ].
When sew=64, one element occupies eight bytes, one element represents a single vector element, source data in a source register is divided into a plurality of elements according to eight-byte equal division, the elements are sequentially inserted and stored in a register file, and the first channel unit, the second channel unit, the third channel unit and the fourth channel unit correspond to the register file to take values, so that data calculation is performed. The 0 th element of the first type source data is stored in the 0X 00-0X 07 byte position of the first source register, and the 0 th element of the second type source data is stored in the 0X 00-0X 07 byte position of the second source register; the 1 st element of the first type source data is stored in the 0X 08-0X 0F byte position of the first source register, and the 1 st element of the second type source data is stored in the 0X 08-0X 0F byte position of the second source register; the 2 nd element of the first type source data is stored in the 0X 10-0X 17 byte position of the first source register, and the 2 nd element of the second type source data is stored in the 0X 10-0X 17 byte position of the second source register; the 3 rd element of the first type of source data is stored in the 0X 18-0X 1F byte position of the first source register, and the 3 rd element of the second type of source data is stored in the 0X 18-0X 1F byte position of the second source register.
When sew=32, one element occupies four bytes, source data in a source register is divided into a plurality of elements according to the four bytes, the elements are sequentially arranged and stored in a register file, and the first channel unit, the second channel unit, the third channel unit and the fourth channel unit correspond to the register file to take values, so that data calculation is performed. The 0 th element of the first type source data is stored in the 0X 00-0X 03 byte position of the first source register, and the 0 th element of the second type source data is stored in the 0X 00-0X 03 byte position of the second source register; the 1 st element of the first type source data is stored in the 0X 08-0X 0B byte position of the first source register, and the 1 st element of the second type source data is stored in the 0X 08-0X 0B byte position of the second source register; the 2 nd element of the first type source data is stored in the 0X 10-0X 13 byte position of the first source register, and the 2 nd element of the second type source data is stored in the 0X 10-0X 13 byte position of the second source register; the 3 rd element of the first type source data is stored in the 0X 18-0X 1B byte position of the first source register, and the 3 rd element of the second type source data is stored in the 0X 18-0X 1B byte position of the second source register; the 4 th element of the first type source data is stored in the 0X 04-0X 07 byte position of the first source register, and the 4 th element of the second type source data is stored in the 0X 04-0X 07 byte position of the second source register; the 5 th element of the first type source data is stored in the 0X 0C-0X 0F byte position of the first source register, and the 5 th element of the second type source data is stored in the 0X 0C-0X 0F byte position of the second source register; the 6 th element of the first type source data is stored in the 0X 14-0X 17 byte position of the first source register, and the 6 th element of the second type source data is stored in the 0X 14-0X 17 byte position of the second source register; the 7 th element of the first type source data is stored in the 0X 1C-0X 1F byte position of the first source register, and the 7 th element of the second type source data is stored in the 0X 1C-0X 1F byte position of the second source register.
When sew=16, one element occupies two bytes, source data in a source register is divided into a plurality of elements according to two bytes, the elements are sequentially arranged and stored in a register file, and a first channel unit, a second channel unit, a third channel unit and a fourth channel unit correspond to the register file to take values, so that data calculation is performed. The 0 th element of the first type source data is stored in the 0X 00-0X 01 byte position of the first source register, and the 0 th element of the second type source data is stored in the 0X 00-0X 01 byte position of the second source register; the 1 st element of the first type source data is stored in the 0X 08-0X 09 byte position of the first source register, and the 1 st element of the second type source data is stored in the 0X 08-0X 09 byte position of the second source register; the 2 nd element of the first type source data is stored in the 0X 10-0X 11 byte position of the first source register, and the 2 nd element of the second type source data is stored in the 0X 10-0X 11 byte position of the second source register; the 3 rd element of the first type source data is stored in the 0X 18-0X 19 byte position of the first source register, and the 3 rd element of the second type source data is stored in the 0X 18-0X 19 byte position of the second source register; the 4 th element of the first type source data is stored in the 0X 04-0X 05 byte position of the first source register, and the 4 th element of the second type source data is stored in the 0X 04-0X 05 byte position of the second source register; the 5 th element of the first type source data is stored in the 0X 0C-0X 0D byte position of the first source register, and the 5 th element of the second type source data is stored in the 0X 0C-0X 0D byte position of the second source register; the 6 th element of the first type source data is stored in the 0X 14-0X 15 byte position of the first source register, and the 6 th element of the second type source data is stored in the 0X 14-0X 15 byte position of the second source register; the 7 th element of the first type source data is stored in the 0X 1C-0X 1D byte position of the first source register, and the 7 th element of the second type source data is stored in the 0X 1C-0X 1D byte position of the second source register; the 8 th element of the first type source data is stored in the 0X 02-0X 03 byte position of the first source register, and the 8 th element of the second type source data is stored in the 0X 02-0X 03 byte position of the second source register; the 9 th element of the first type source data is stored in the 0X 0A-0X 0B byte position of the first source register, and the 9 th element of the second type source data is stored in the 0X 0A-0X 0B byte position of the second source register; the 10 (0X 0A) th element of the first type source data is stored in the 0X 12-0X 13 byte position of the first source register, and the 10 (0X 0A) th element of the second type source data is stored in the 0X 12-0X 13 byte position of the second source register; the 11 (0X 0B) th element of the first type source data is stored in the 0X 1A-0X 1B byte position of the first source register, and the 11 (0X 0B) th element of the second type source data is stored in the 0X 1A-0X 1B byte position of the second source register; the 12 (0X 0C) th element of the first type source data is stored in the 0X 06-0X 07 byte position of the first source register, and the 12 (0X 0C) th element of the second type source data is stored in the 0X 06-0X 07 byte position of the second source register; the 13 (0X 0D) th element of the first type source data is stored in the 0X 0E-0X 0F byte position of the first source register, and the 13 (0X 0D) th element of the second type source data is stored in the 0X 0E-0X 0F byte position of the second source register; the 14 th (0X 0E) element of the first type source data is stored in the 0X 16-0X 17 byte position of the first source register, and the 14 th (0X 0E) element of the second type source data is stored in the 0X 16-0X 17 byte position of the second source register; the 15 th (0X 0F) element of the first type of source data is stored in the 0X 1E-0X 1F byte position of the first source register, and the 15 th (0X 0F) element of the second type of source data is stored in the 0X 1E-0X 1F byte position of the second source register.
When sew=8, one element occupies one byte, source data in a source register is equally divided into a plurality of elements according to 1 byte, the elements are sequentially arranged and stored in a register file, and the first channel unit, the second channel unit, the third channel unit and the fourth channel unit correspond to the register file to take values, so that data calculation is performed. The 0 th element of the first type source data is stored in the 0X00 byte position of the first source register, and the 0 th element of the second type source data is stored in the 0X00 byte position of the second source register; the 1 st element of the first type source data is stored in the 0X08 byte position of the first source register, and the 1 st element of the second type source data is stored in the 0X08 byte position of the second source register; the 2 nd element of the first type source data is stored in the 0X10 byte position of the first source register, and the 2 nd element of the second type source data is stored in the 0X10 byte position of the second source register; the 3 rd element of the first type of source data is stored in the 0X18 byte position of the first source register, and the 3 rd element of the second type of source data is stored in the 0X18 byte position of the second source register; the 4 th element of the first type source data is stored in the 0X04 byte position of the first source register, and the 4 th element of the second type source data is stored in the 0X04 byte position of the second source register; the 5 th element of the first type source data is stored in the 0X0C byte position of the first source register, and the 5 th element of the second type source data is stored in the 0X0C byte position of the second source register; the 6 th element of the first type source data is stored in the 0X14 byte position of the first source register, and the 6 th element of the second type source data is stored in the 0X14 byte position of the second source register; the 7 th element of the first type source data is stored in the 0X1C byte position of the first source register, and the 7 th element of the second type source data is stored in the 0X1C byte position of the second source register; the 8 th element of the first type source data is stored in the 0X02 byte position of the first source register, and the 8 th element of the second type source data is stored in the 0X02 byte position of the second source register; the 9 th element of the first type source data is stored in the 0X0A byte position of the first source register, and the 9 th element of the second type source data is stored in the 0X0A byte position of the second source register; the 10 (0X 0A) th element of the first type of source data is stored in the 0X12 byte position of the first source register, and the 10 (0X 0A) th element of the second type of source data is stored in the 0X12 byte position of the second source register; the 11 (0X 0B) th element of the first type of source data is stored in the 0X1A byte position of the first source register, and the 11 (0X 0B) th element of the second type of source data is stored in the 0X1A byte position of the second source register; the 12 (0X 0C) th element of the first type of source data is stored in the 0X06 byte position of the first source register, and the 12 (0X 0C) th element of the second type of source data is stored in the 0X06 byte position of the second source register; the 13 (0X 0D) th element of the first type of source data is stored in the 0X0E byte position of the first source register, and the 13 (0X 0D) th element of the second type of source data is stored in the 0X0E byte position of the second source register; the 14 th (0X 0E) element of the first type of source data is stored in the 0X16 byte position of the first source register, and the 14 th (0X 0E) element of the second type of source data is stored in the 0X16 byte position of the second source register; 15 th (0X 0F) elements of the first type of source data are stored in the 0X1E byte position of the first source register, and 15 th (0X 0F) elements of the second type of source data are stored in the 0X1E byte position of the second source register; the 16 th (0X 10) element of the first type source data is stored in the 0X01 byte position of the first source register, and the 16 th (0X 10) element of the second type source data is stored in the 0X01 byte position of the second source register; the 17 th (0X 11) element of the first type of source data is stored in the 0X09 byte position of the first source register, and the 17 th (0X 11) element of the second type of source data is stored in the 0X09 byte position of the second source register; the 18 (0X 12) th element of the first type of source data is stored in the 0X11 byte position of the first source register, and the 18 (0X 12) th element of the second type of source data is stored in the 0X11 byte position of the second source register; the 19 th (0X 13) element of the first type of source data is stored in the 0X19 byte position of the first source register, and the 19 th (0X 13) element of the second type of source data is stored in the 0X19 byte position of the second source register; the 20 (0X 14) th element of the first type of source data is stored in the 0X05 byte position of the first source register, and the 20 (0X 14) th element of the second type of source data is stored in the 0X05 byte position of the second source register; the 21 (0X 15) th element of the first type of source data is stored in the 0X0D byte position of the first source register, and the 21 (0X 15) th element of the second type of source data is stored in the 0X0D byte position of the second source register; the 22 (0X 16) th element of the first type of source data is stored in the 0X15 byte position of the first source register, and the 22 (0X 16) th element of the second type of source data is stored in the 0X15 byte position of the second source register; the 23 (0X 17) th element of the first type of source data is stored in the 0X1D byte position of the first source register, and the 23 (0X 17) th element of the second type of source data is stored in the 0X1D byte position of the second source register; the 24 (0X 18) th element of the first type of source data is stored in the 0X03 byte position of the first source register, and the 24 (0X 18) th element of the second type of source data is stored in the 0X03 byte position of the second source register; the 25 (0X 19) th element of the first type of source data is stored in the 0X0B byte position of the first source register, and the 25 (0X 19) th element of the second type of source data is stored in the 0X0B byte position of the second source register; the 26 (0X 1A) th element of the first type of source data is stored in the 0X13 byte position of the first source register, and the 26 (0X 1A) th element of the second type of source data is stored in the 0X13 byte position of the second source register; the 27 (0X 1B) th element of the first type of source data is stored in the 0X1B byte position of the first source register, and the 27 (0X 1B) th element of the second type of source data is stored in the 0X1B byte position of the second source register; the 28 (0X 1C) th element of the first type of source data is stored in the 0X07 byte position of the first source register, and the 28 (0X 1C) th element of the second type of source data is stored in the 0X07 byte position of the second source register; the 29 (0X 1D) th element of the first type of source data is stored in the 0X0F byte position of the first source register, and the 29 (0X 1D) th element of the second type of source data is stored in the 0X0F byte position of the second source register; the 30 (0X 1E) th element of the first type of source data is stored in the 0X17 byte position of the first source register, and the 30 (0X 1E) th element of the second type of source data is stored in the 0X17 byte position of the second source register; the 31 (0X 1F) th element of the first type of source data is stored in the 0X1F byte position of the first source register, and the 31 (0X 1F) th element of the second type of source data is stored in the 0X1F byte position of the second source register.
The ALU operation subunit comprises: the device comprises an integer adding module, a fixed point average adding module, a fixed point saturation adding module, a fixed point shifting module, an integer logic module and an integer comparison module;
the integer adding module comprises: an 8-bit adder, a 16-bit adder, a 32-bit adder, and a 64-bit adder; the fixed-point average adding module comprises: an 8-bit adder, a 16-bit adder, a 32-bit adder, and a 64-bit adder; the fixed point saturation adding module comprises: an 8-bit adder, a 16-bit adder, a 32-bit adder, and a 64-bit adder; the fixed point shift module comprises: 8-bit fixed point shift, 16-bit fixed point shift, 32-bit fixed point shift, and 64-bit fixed point shift; the integer shift module includes: 8-bit integer shift, 16-bit integer shift, 32-bit integer shift, and 64-bit integer shift; the integer logic module includes: and, or, exclusive or; the integer comparison module is used for making a difference on the data input to the ALU operation subunit and judging the size according to the difference.
FIG. 4 shows an ALU operation subunit with 64 bits, data 1 being the first type of source data and data 2 being the second type of source data. The data 1 and the data 2 are input from a register file and participate in calculation, the data 1 can be vector, scalar and immediate data types, the data 2 can only be vector types, if the data are scalar or immediate data, the data can be expanded into corresponding widths according to sew, and finally the data of each channel are 64-bit widths; the ALU operation subunit controls input data according to the executed instruction, and selects a module from 7 modules to calculate a correct result; the instruction realized by the integer adding module is shown in table 1; the instruction realized by the fixed point saturation adding module is shown in table 1; the instruction realized by the fixed point average adding module is shown in table 1; the instruction realized by the fixed point shift module is shown in table 1; the instruction realized by the integer shift module is shown in table 1; the integer logic module internally comprises 3 kinds of logic operations, and the realized instruction is shown in a table 1; the integer comparison module compares whether two source data are same in sign or not, different signs are directly compared in size, the same sign is used for comparing the difference of the two source data by calling the integer addition module, and the realized instruction is shown in a table 1; and 7 modules calculate simultaneously, and finally correct results are selected and output according to the instruction.
TABLE 1
Instruction type Instruction name Instruction description Compatible width Execution module
Integer arithmetic operations VADD,VADC, VREDSUM, VWREDSUMU, VWREDSUM, VSUB,VSBC, VRSUB Adding, adding carry, adding,reduction and summation, addBroad reduction unsignedSumming, widening and returningThe sum of the values is approximately summed, subtracted,borrowing and subtracting, and reflecting and subtracting 8\16\32\64 Integer adding module
Fixed point saturation operation VSADDU,VSADD, VSSUBU,VSSUB Unsigned/signed saturationWith no/signed symbolSaturation reduction 8\16\32\64 Fixed point saturation adding mouldBlock and method for manufacturing the same
Fixed point averaging operation VAADDU,VAADD, VASUBU,VASUB, Unsigned/signed averagingWith no/signed symbolAverage subtraction 8\16\32\64 Fixed point average adding mouldBlock and method for manufacturing the same
Fixed point shift operation VSSRL,VSSRA, VNCLIPU, VNCLIP, Scaling logic/arithmeticRight shift no \no symbolNumber cutting 8\16\32\64 Fixed point shift module
Integer shift operation VSLL,VSRL, VSRA, VNSRL, VNSRA, Logical left/right shiftBit, arithmetic right shiftBits, narrowing logicEditing/arithmetic right shift 8\16\32\64 Integer shift module
Integer logic operations VAND,VREDAND, VOR,VREDOR, VXOR,VREDXOR And, the reduction and the reduction are carried out,or, a reduction or,exclusive or, reduction exclusive or 8\16\32\64 Integer logic module
Integer comparison operation VMIN, VMINU, VMAX, VMAXU, VREDMINU, VREDMIN, VREDMAXU, VREDMAX, VMSEQ, VMSNE, VMSLT, VMSLTU, VMSLE, VMSLEU, VMSGT, VMSGTU, Vmsge,vmsgeu Minimum with/without symbolValue, with/without signThe number maximum value is given by the number,signed reductionMinimum value, none \Signed reduction maximumLarge value, equal value,unequal, there is \The non-symbol is less than the number,with/without sign less thanEqual to, with/withoutThe sign is larger than the sign of the symbol,with/without sign greater thanEqual to 8\16\32\64 An integer comparison module,Integer adding module
The 8-bit adder in the integer adder module comprises: two selectors and 3 4-bit carry-lookahead adders; the 16-bit adder in the integer adder module comprises: two selectors and 3 8-bit select carry adders; the 32-bit adder in the integer adder module comprises: two selectors and 3 16-bit select carry adders; the 64-bit adder in the integer adder module comprises: two selectors and 3 32-bit select carry adders.
The 4-bit adder consists of 4 1-bit full adders, and the combination mode can be serial carry or parallel carry. If 4 full adders form a 4-bit serial carry adder, the critical path is time-consuming: 3t+4 x (t+t) =11t, where the delay of the and, or, non-primary gate is T; the NAND, NOR gate delay is 2T; the exclusive or gate delay is 3T; if 4 full adders form a 4-bit parallel carry adder, the critical path is time-consuming: 3t+t+3t+t=8t. Therefore, in this embodiment, the 4-bit adder adopts a parallel connection mode, and the parallel adder adopts a carry-ahead adder, and the structure is as shown in fig. 5, and the carry input signal of each bit full adder is obtained in advance through the logic circuit, so that the operation speed can be improved.
The expression of the 4-bit carry-lookahead adder is as follows:
Figure SMS_14
(1)
Figure SMS_15
(2)
Figure SMS_16
(3)
carry descriptions are generated by the above (1), (2) and (3):
Figure SMS_17
(4)
Figure SMS_18
(5)/>
Figure SMS_19
(6)
Figure SMS_20
(7)
Figure SMS_21
(8)
Figure SMS_22
(9)
wherein ,
Figure SMS_23
the 4 bit carry lookahead adder>
Figure SMS_28
Bit output, 4 bits total; />
Figure SMS_30
First kind of source data for input 4-bit carry-lookahead adder>
Figure SMS_25
Bits, 4 total bits; />
Figure SMS_26
Second type of source data for input 4-bit carry-lookahead adder>
Figure SMS_31
Bits, 4 total bits; />
Figure SMS_32
Is->
Figure SMS_24
The outputs of the NOR gates, 3 in total,>
Figure SMS_27
is- >
Figure SMS_29
The output of the nor gate; />
Figure SMS_33
Is an exclusive or operation.
The multi-bit adder in the integer adder module includes: the multi-bit adder comprises a first selector, a second selector, a first multi-bit adder sub-module, a second multi-bit adder sub-module and a third multi-bit adder sub-module, wherein the number of bits of the multi-bit adder is N, the number of bits of the multi-bit adder sub-module is N/2, and the N value comprises: 8. 16, 32 and 64, the multi-bit adder sub-module comprises: carry look ahead adder and carry select adder; the data input to the N-bit adder in the integer adder module includes: the method comprises the steps of obtaining low N/2 bits of first type source data, high N/2 bits of first type source data, low N/2 bits of second type source data and high N/2 bits of second type source data, wherein the first type source data, the second type source data and input carry are N bits; inputs to the first N/2 bit adder sub-block include: the low N/2 bits of the first type of source data, the low N/2 bits of the second type of source data, and the input carry, the output of which comprises: the first output carry sum outputs a low N/2 bit sum; inputs to the second N/2 bit adder sub-block include: the high N/2 bits of the first type of source data, the high N/2 bits of the second type of source data, and the low carry, the output of which comprises: the second output carry and the first output high N/2 bit sum; the inputs to the third N/2 bit adder sub-block include: the high N/2 bits of the first type source data, the high N/2 bits of the second type source data and the high carry, the output of which comprises: the third output carries and the second output high N/2 sums; the input of the first selector includes: first, second and third output carries, the outputs of which include: selecting an output carry; the input of the second selector includes: the first output carry, the first output high N/2 bit sum, and the second output high N/2 bit sum, the outputs of which include: selecting and outputting high N/2 bit sums; and selecting and outputting high N/2 bits and splicing with low N/2 bits, and combining the selected and outputted carry to obtain the calculation result of the N-bit adder.
The N-bit adder in the input integer adder module specifically comprises the following types:
as shown in fig. 6, the structure of the 8-bit adder in the integer adder module includes: a first selector, a second selector, a first 4-bit carry-lookahead adder, a second 4-bit carry-lookahead adder, and a third 4-bit carry-lookahead adder; the data input to the 8-bit adder in the integer adder block includes: the method comprises the steps of obtaining first type source data, second type source data and input carry, wherein the first type source data and the second type source data are 8 bits, and the first type source data and the second type source data after entering an 8-bit adder are divided into high 4 bits and low 4 bits to obtain the low 4 bits of the first type source data, the high 4 bits of the first type source data, the low 4 bits of the second type source data and the high 4 bits of the second type source data; the inputs to the first 4-bit carry-lookahead adder include: the low 4 bits of the first type of source data, the low 4 bits of the second type of source data, and the input carry, the output of which comprises: the first output carry sum outputs the lower 4-bit sum; the inputs to the second 4-bit carry-lookahead adder include: the high 4 bits of the first type of source data, the high 4 bits of the second type of source data, and the low carry, the output of which comprises: the second output carry and the first output high 4-bit sum; the inputs to the third 4-bit carry-lookahead adder include: the high 4 bits of the first type source data, the high 4 bits of the second type source data and the high carry, the output of which comprises: the third output carry and the second output high 4-bit sum; the input of the first selector includes: first, second and third output carries, the outputs of which include: selecting an output carry; the input of the second selector includes: the first output carry, the first output high 4-bit sum, and the second output high 4-bit sum, the outputs of which include: selecting and outputting a high 4-bit sum; and selecting and outputting the high 4 bits and the low 4 bits, and splicing, and combining the selected and outputted carry to obtain the calculation result of the 8-bit adder.
Specifically in fig. 6, 8-bit data is divided into 2 groups, the lower 4 bits [3:0] and the upper 4 bits [7:4], the carry of the lower 4 bits is a determined input carry Cin, and when no carry is input, the default carry is 0; the carry of the upper 4 bits is an indeterminate value, either 0 or 1, the default carry being 0. Simultaneously calculating 1 low 4-bit addition and 2 high 4-bit addition to obtain 1 low 4-bit sum S [3:0], carry C3 and 2 high 4-bit sum [7:4], carry C7; then according to the value of the result C3 of the low 4-bit addition, the high 4-bit addition carry C7 is selected, C3=0 selects C7 and S [7:4] of the 1 st high 4-bit adder, C3=1 selects C7 and S [7:4] of the 2 nd high 4-bit adder; splicing S3:0 and S7:4 together to form a summation result; finally, carry cout=c7 and the sum result S [7:0] are output.
As shown in fig. 7, the 16-bit adder in the integer adder module includes: the first selector, the second selector, the first 8-bit select carry adder, the second 8-bit select carry adder and the third 8-bit select carry adder; the data input to the 16-bit adder in the integer adder block includes: the method comprises the steps of obtaining first type source data, second type source data and input carry, wherein the first type source data and the second type source data are 16 bits, and the first type source data and the second type source data after entering a 16-bit adder are divided into high 8 bits and low 8 bits to obtain low 8 bits of the first type source data, high 8 bits of the first type source data, low 8 bits of the second type source data and high 8 bits of the second type source data; the inputs to the first 8-bit select carry adder include: the low 8 bits of the first type of source data, the low 8 bits of the second type of source data, and the input carry, the output of which comprises: the first output carry sum outputs the lower 8-bit sum; the inputs to the second 8-bit select carry adder include: the high 8 bits of the first type source data, the high 8 bits of the second type source data and the low carry, the output of which comprises: the second output carry and the first output high 8-bit sum; the inputs to the third 8-bit select carry adder include: the high 8 bits of the first type source data, the high 8 bits of the second type source data and the high carry, the output of which comprises: the third output carries and the second output high 8-bit sums; the input of the first selector includes: first, second and third output carries, the outputs of which include: selecting an output carry; the input of the second selector includes: the first output carry, the first output high 8-bit sum, and the second output high 8-bit sum, the outputs of which include: selecting and outputting a high 8-bit sum; and selecting and outputting the high 8 bits and the low 8 bits, and splicing, and combining the selected and outputted carry to obtain the calculation result of the 16-bit adder.
Specifically, in fig. 7, 16-bit data is divided into 2 groups, a lower 8 bits [7:0] and an upper 8 bits [15:8], the carry of the lower 8 bits is a determined input carry Cin, and when no carry is input, the default carry is 0; the carry of the upper 8 bits is an indeterminate value, either 0 or 1, the default carry being 0. Simultaneously calculating 1 low 8-bit addition and 2 high 8-bit addition to obtain 1 low 8-bit sum S [7:0], carry C7 and 2 high 8-bit sum [15:8], carry C15; then according to the value of the result C7 of the low 8-bit addition, the high 8-bit addition carry C15 is selected, C7=0 selects C15 and S [15:8] of the 1 st high 8-bit adder, and C7=1 selects C15 and S [15:8] of the 2 nd high 8-bit adder; splicing S7:0 and S15:8 together to form a summation result; finally, carry cout=c15 and the sum result S [15:0] are output.
As shown in fig. 8, the data input to the 32-bit adder in the integer adder block includes: the method comprises the steps of obtaining first type source data, second type source data and input carry, wherein the first type source data and the second type source data are 32 bits, and the first type source data and the second type source data after entering a 32-bit adder are divided into high 16 bits and low 16 bits to obtain low 16 bits of the first type source data, high 16 bits of the first type source data, low 16 bits of the second type source data and high 16 bits of the second type source data; inputs to the first 16-bit select carry adder include: the low 16 bits of the first type of source data, the low 16 bits of the second type of source data, and the input carry, the output of which comprises: the first output carry sum outputs a lower 16-bit sum; inputs to the second 16-bit select carry adder include: the high 16 bits of the first type of source data, the high 16 bits of the second type of source data, and the low carry, the output of which comprises: the second output carry and the first output high 16-bit sum; inputs to the third 16-bit select carry adder include: the high 16 bits of the first type of source data, the high 16 bits of the second type of source data, and the high carry, the output of which comprises: the third output carry and the second output high 16-bit sum; the input of the first selector includes: first, second and third output carries, the outputs of which include: selecting an output carry; the input of the second selector includes: the first output carry, the first output high 16-bit sum, and the second output high 16-bit sum, the outputs of which include: selecting and outputting a high 16-bit sum; and selecting and outputting high 16 bits and low 16 bits, and splicing, and combining the selected and output carry to obtain the calculation result of the 32-bit adder.
Specifically, in fig. 8, 32-bit data is divided into 2 groups, a lower 16 bits [15:0] and an upper 16 bits [31:16], the lower 16 bits carry is a determined input carry Cin, and when no carry is input, the default carry is 0; the upper 16 bits carry an indeterminate value, either 0 or 1, the default carry being 0. Simultaneously calculating 1 low 16-bit addition and 2 high 16-bit additions to obtain 1 low 16-bit sum S [15:0], carry C15 and 2 high 16-bit sum [31:16], carry C31; then according to the value of the result C15 of the low 16-bit addition, the high 16-bit addition carry C31 is selected, C15=0 selects C31 and S [31:16] of the 1 st high 16-bit adder, and C15=1 selects C31 and S [31:16] of the 2 nd high 16-bit adder; splicing S15:0 and S31:16 together to form a summation result; finally, carry cout=c31 and the sum result S [31:0] are output.
As shown in fig. 9, the data input to the 64-bit adder in the integer adder block includes: the method comprises the steps of obtaining first type source data, second type source data and input carry, wherein the first type source data and the second type source data are 64 bits, and the first type source data and the second type source data after entering a 64-bit adder are divided into high 32 bits and low 32 bits to obtain low 32 bits of the first type source data, high 32 bits of the first type source data, low 32 bits of the second type source data and high 32 bits of the second type source data; inputs to the first 32-bit select carry adder include: the low 32 bits of the first type of source data, the low 32 bits of the second type of source data, and the input carry, the output of which comprises: the first output carry sum outputs a low 32-bit sum; inputs to the second 32-bit select carry adder include: the high 32 bits of the first type of source data, the high 32 bits of the second type of source data, and the low carry, the output of which comprises: the second output carry and the first output high 32-bit sum; the inputs to the third 32-bit select carry adder include: the high 32 bits of the first type source data, the high 32 bits of the second type source data and the high carry, the output of which comprises: the third output carry and the second output high 32-bit sum; the input of the first selector includes: first, second and third output carries, the outputs of which include: selecting an output carry; the input of the second selector includes: the first output carry, the first output high 32-bit sum, and the second output high 32-bit sum, the outputs of which include: selecting and outputting a high 32-bit sum; and selecting and outputting high 32 bits and low 32 bits, and splicing, and combining the selected and output carry to obtain the calculation result of the 64-bit adder.
Specifically, in FIG. 9, 64 bits of data are divided into 2 groups, the lower 32 bits [31:0] and the upper 32 bits [63:32], the lower 32 bits carry is a determined input carry Cin, and when no carry is input, the default carry is 0; the high 32-bit carry is an indeterminate value, the carry is either 0 or 1, and the default carry is 0. Simultaneously calculating 1 low 32 bit addition and 2 high 32 bit addition to obtain 1 low 32 bit sum S [31:0], carry C31 and 2 high 32 bit sum [63:32], carry C63; then, according to the value of the result C31 of the low-order 32-bit addition, the high-order 32-bit addition carry C63 is selected, C31=0 selects C63 and S [63:32] of the 1 st high-order 32-bit adder, and C31=1 selects C63 and S [63:32] of the 2 nd high-order 32-bit adder; splicing S [31:0] and S [63:32] together to form a summation result; finally, carry cout=c63 and the sum result S [63:0] are output.
Specifically in FIG. 4, the fixed point saturation adder block (e) of the 64-bit ALU operation subunit, the 8-bit adder (e 8) comprises: an 8-bit selection adder (d 8) of 1 integer, 1 saturation operation module; the 16-bit adder (e 16) includes: 1 integer 16 bit select adder (d 16), 1 saturation operation module; the 32-bit adder (e 32) includes: 1 integer 32 bit selection adder (d 32), 1 saturation operation module; the 64-bit adder (e 64) includes: 1 integer 64 bit select adder (d 64), 1 saturation operation module. The saturation operation module judges whether the result of the adder overflows or not, and outputs a saturation value when the result of the adder overflows; and outputting the adder summation value when the adder result does not overflow. The output result is the result of the addition and saturation calculation.
A fixed point average adder block (f) of a 64-bit ALU operator unit, an 8-bit adder (f 8) comprising: an 8-bit select adder (d 8) of 1 integer, 1 shifter, 1 rounding block; the 16-bit adder (f 16) includes: 1 integer 16-bit select adder (d 16), 1 shifter, 1 rounding block; the 32-bit adder (f 32) includes: 1 integer, 32-bit select adder (d 32), 1 shifter, 1 rounding module; the 64-bit adder (f 64) includes: 1 integer 64 bit select adder (d 64), 1 shifter, 1 rounding block. The shifter is used for the average calculation of the adder output value, i.e. a fixed shift. The rounding module determines whether the average addition result value is a cut or an in according to the fixed point rounding mode. The result output is the result of the addition, shift 1 bit and rounding calculations.
A fixed point shift module (g) of a 64-bit ALU operation subunit, an 8-bit fixed point shift (g 8) comprising: 1 8 bit shifter, 1 rounding module; the 16-bit fixed point shift (g 16) includes: 1 16 bit shifter, 1 rounding module; the 32-bit fixed point shift (g 32) includes: 1 32-bit shifter, 1 rounding module; the 64-bit fixed point shift (g 64) includes: 1 64 bit shifter and 1 rounding module. The shifter is used for shifting. The rounding module judges whether the shifted data is a house or an in according to the fixed point rounding mode. The output result is the shifted result.
An integer shift module (h) of a 64-bit ALU operation subunit, the 8-bit integer shift (h 8) comprising: 1 shifter is used for shifting; the 16 integer shift (h 16) includes: 1 shifter is used for shifting; the 32 integer shift (h 32) includes: 1 shifter is used for shifting; the 64 integer shift (h 64) includes: 1 shifter is used for shifting. The integer shift module (h) directly shifts the data, and the shifted data is directly discarded. The output result is the shifted result.
The integer logic block (i) of the 64-bit ALU operation subunit, in which are all bit operations performed, includes an and, or, exclusive or function sub-block. The output result is the result after the logic computation.
And (d) an integer adding module is called in the integer comparing module (j) of the 64-bit ALU operation subunit, the difference between the two numbers is directly carried out, and the difference value and the 0 size are compared, so that the maximum value, the minimum value or the greater value and the lesser value are judged. If the integer comparison instruction is executed, outputting a Boolean value as an output result, outputting 1 as a comparison result is true, otherwise outputting 0; if an integer-most instruction is executed, the output result is the value of the source operand.
The output of the 64-bit ALU operation subunit, 7 modules in the 64-bit ALU circuit (c), selects the result of one module according to the operation type of the instruction, and outputs the result, namely the result of the instruction calculation.
The ALU processing system comprises two data types, namely a pure integer (a fixed point is at the end of the number, the numbers after the decimal point are all 0 and are generally directly expressed as an integer) and a pure decimal (the bit is 0, the decimal point is after the bit, and the first bit after the decimal point is not 0), so that two instructions of the integer and the fixed point instruction are realized, and 8, 16, 32 and 64-bit element bit widths are supported.
The CPU internally transmits a vector instruction, the vector instruction is transmitted to a decoder for decoding, the instruction has specific codes, the data type (element bit width of vector source data, an operation part and the like) is contained in the instruction codes, the instruction is indicated to carry out integer or fixed-point operation, and different instructions correspond to different instruction codes.
The ALU operation subunit divides vector instructions into 7 general classes of arithmetic unit computations: integer addition, fixed point saturation addition, fixed point average addition, fixed point shift, integer logic, integer comparison calculation. The 7 modules select operation modules according to the instructions, the operation modules with different bits in the operation modules can calculate incoming data in parallel, and different results are selected according to the instructions after the results are shot.
Before entering a register file, the data can carry out shuffling operation, support vector instruction execution, speed up parallelism, and facilitate execution of cross-width instructions to execute cross-width operations.
The technical scheme of the embodiment of the invention has at least the following advantages and beneficial effects:
1. the fixed-point pure decimal does not need to be calculated as a floating point number, an exponent part and a mantissa part are not calculated, the fixed-point pure decimal directly enters a pure decimal operation unit, and a shooting result is executed.
2. The multi-channel speed is high, the data are respectively stored in 4 channels and processed in parallel, the operation speed is improved, and 256-bit data can be calculated.
3. And parallel structure calculation is adopted, the calculation unit is fast, a plurality of results are calculated, and the results are selected according to the instructions.
4. The compatibility is good, and the operation unit is compatible with a plurality of width operations of 8, 16, 32 and 64.
5. The expandability is strong in prospective, the 3-bit data control width of one privilege register is set, the 3-bit data control element bit width is only used for 3 bits at present, and future data can be expanded to 128, 256, 512 and 1024 widths.
6. The calculation data types are wide, including vector sum vector, vector sum scalar, vector sum immediate calculation.
7. Rounding mode diversification, 4 rounding modes supporting fixed point: distance equal round-up (rnu), distance equal even round-up (rne), direct truncation (rdn), and odd round-up (rod).
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. An ALU processing system for vector pointing, comprising: the system comprises a decoder, an instruction transmitting subsystem, a register file and an execution subsystem;
the decoder is used for decoding the vector instruction to obtain a decoded vector instruction; the transmitting instruction subsystem is used for receiving the decoding vector instruction, determining an executing subsystem for executing operation according to the operation type in the decoding vector instruction, and transmitting address data in the decoding vector instruction to the register file; the execution subsystem is used for executing a data operation instruction in the decoding vector instruction; the register file is used for storing operation results;
the coding vector instruction is 32 bits in length, including: vector instruction type, 7 bits; the address of the target register occupies 5 bits; the function 3 operation code occupies 3 sites; the address of the first source register occupies 5 bits; the address of the second source register occupies 5 bits; masking operation is enabled, taking up 1 bit; the function 6 operation code occupies 6 bits;
the execution subsystem includes: a first channel unit, a second channel unit, a third channel unit, and a fourth channel unit;
an ALU operation subunit is included in each channel unit;
the execution subsystem retrieves the data to be processed from the register file, divides the data to be processed into 4 parts, sends each part of data to be processed into one channel unit, and the channel units into which the 4 parts of data to be processed enter are different;
In each channel unit, an ALU operation subunit performs operation on data to be processed in the channel unit;
the operation types include: integer arithmetic, integer logic, integer movement, integer reduction, fixed point average addition and subtraction, fixed point saturation addition and subtraction, fixed point logic, and fixed point narrowing instructions;
the ALU operation subunit comprises: the device comprises an integer adding module, a fixed point average adding module, a fixed point saturation adding module, a fixed point shifting module, an integer logic module and an integer comparison module;
the integer adding module comprises: an 8-bit adder, a 16-bit adder, a 32-bit adder, and a 64-bit adder; the fixed-point average adding module comprises: an 8-bit adder, a 16-bit adder, a 32-bit adder, and a 64-bit adder; the fixed point saturation adding module comprises: an 8-bit adder, a 16-bit adder, a 32-bit adder, and a 64-bit adder; the fixed point shift module comprises: 8-bit fixed point shift, 16-bit fixed point shift, 32-bit fixed point shift, and 64-bit fixed point shift; the integer shift module includes: 8-bit integer shift, 16-bit integer shift, 32-bit integer shift, and 64-bit integer shift; the integer logic module includes: and, or, exclusive or; the integer comparison module is used for making a difference on the data input to the ALU operation subunit and judging the size according to the difference.
2. The vector fixed point ALU processing system of claim 1, wherein said 8-bit adder in said integer adder-block comprises: two selectors and 3 4-bit carry-lookahead adders; the 16-bit adder in the integer adder module comprises: two selectors and 3 8-bit select carry adders; the 32-bit adder in the integer adder module comprises: two selectors and 3 16-bit select carry adders; the 64-bit adder in the integer adder module comprises: two selectors and 3 32-bit select carry adders.
3. The vector fixed point ALU processing system of claim 2, wherein said 4-bit carry-lookahead adder has an operation expression of:
Figure QLYQS_1
Figure QLYQS_2
wherein ,
Figure QLYQS_4
the 4 bit carry lookahead adder>
Figure QLYQS_6
Bit output, 4 bits total; />
Figure QLYQS_9
First kind of source data for input 4-bit carry-lookahead adder>
Figure QLYQS_5
Bits, 4 total bits; />
Figure QLYQS_7
Second type of source data for input 4-bit carry-lookahead adder>
Figure QLYQS_10
Bits, 4 total bits; />
Figure QLYQS_12
Is->
Figure QLYQS_3
The outputs of the NOR gates, 3 in total,>
Figure QLYQS_8
is->
Figure QLYQS_11
The output of the nor gate; />
Figure QLYQS_13
Is an exclusive or operation. />
4. The vector fixed point ALU processing system of claim 1, wherein the multi-bit adder in the integer adder-block comprises: a plurality of selectors and a plurality of multi-bit adder sub-modules;
Each multi-bit adder submodule is used for adding operation to input source data according to carry to obtain output carry and output data;
the selector is used for selecting output carry and output data of the multi-bit adder sub-module.
5. The vector fixed point ALU processing system of claim 4, wherein the number of selectors is 2 and the number of multi-bit adder sub-modules is 3.
6. The vector fixed point ALU processing system of claim 2, wherein the multi-bit adder in the integer adder-block comprises: the multi-bit adder comprises a first selector, a second selector, a first multi-bit adder sub-module, a second multi-bit adder sub-module and a third multi-bit adder sub-module, wherein the number of bits of the multi-bit adder is N, the number of bits of the multi-bit adder sub-module is N/2, and the N value comprises: 8. 16, 32 and 64, the multi-bit adder sub-module comprises: carry look ahead adder and carry select adder;
the data input to the N-bit adder in the integer adder module includes: the method comprises the steps of obtaining low N/2 bits of first type source data, high N/2 bits of first type source data, low N/2 bits of second type source data and high N/2 bits of second type source data, wherein the first type source data, the second type source data and input carry are N bits;
Inputs to the first N/2 bit adder sub-block include: the low N/2 bits of the first type of source data, the low N/2 bits of the second type of source data, and the input carry, the output of which comprises: the first output carry sum outputs a low N/2 bit sum; inputs to the second N/2 bit adder sub-block include: the high N/2 bits of the first type of source data, the high N/2 bits of the second type of source data, and the low carry, the output of which comprises: the second output carry and the first output high N/2 bit sum; the inputs to the third N/2 bit adder sub-block include: the high N/2 bits of the first type source data, the high N/2 bits of the second type source data and the high carry, the output of which comprises: the third output carries and the second output high N/2 sums; the input of the first selector includes: first, second and third output carries, the outputs of which include: selecting an output carry; the input of the second selector includes: the first output carry, the first output high N/2 bit sum, and the second output high N/2 bit sum, the outputs of which include: selecting and outputting high N/2 bit sums; and selecting and outputting high N/2 bits and splicing with low N/2 bits, and combining the selected and outputted carry to obtain the calculation result of the N-bit adder.
7. The vector-fixed point ALU processing system of claim 1, wherein said address data comprises: an address of a source register and an address of a destination register.
CN202310070128.4A 2023-02-07 2023-02-07 Vector fixed point ALU processing system Active CN115826910B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310070128.4A CN115826910B (en) 2023-02-07 2023-02-07 Vector fixed point ALU processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310070128.4A CN115826910B (en) 2023-02-07 2023-02-07 Vector fixed point ALU processing system

Publications (2)

Publication Number Publication Date
CN115826910A CN115826910A (en) 2023-03-21
CN115826910B true CN115826910B (en) 2023-05-02

Family

ID=85520863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310070128.4A Active CN115826910B (en) 2023-02-07 2023-02-07 Vector fixed point ALU processing system

Country Status (1)

Country Link
CN (1) CN115826910B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103885919A (en) * 2014-03-20 2014-06-25 北京航空航天大学 Multi-DSP and multi-FPGA parallel processing system and implement method
CN108369573A (en) * 2015-12-18 2018-08-03 英特尔公司 The instruction of operation for multiple vector elements to be arranged and logic
CN108459840A (en) * 2018-02-14 2018-08-28 中国科学院电子学研究所 A kind of SIMD architecture floating-point fusion point multiplication operation unit

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2409066B (en) * 2003-12-09 2006-09-27 Advanced Risc Mach Ltd A data processing apparatus and method for moving data between registers and memory
CN102750133B (en) * 2012-06-20 2014-07-30 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD
US9513906B2 (en) * 2013-01-23 2016-12-06 International Business Machines Corporation Vector checksum instruction
CN105373367B (en) * 2015-10-29 2018-03-02 中国人民解放军国防科学技术大学 The vectorial SIMD operating structures for supporting mark vector to cooperate
US20170177352A1 (en) * 2015-12-18 2017-06-22 Intel Corporation Instructions and Logic for Lane-Based Strided Store Operations
US20170269931A1 (en) * 2016-03-16 2017-09-21 National Taiwan University Method and Computing System for Handling Instruction Execution Using Affine Register File on Graphic Processing Unit
CN107329936A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing neural network computing and matrix/vector computing
CN107608715B (en) * 2017-07-20 2020-07-03 上海寒武纪信息科技有限公司 Apparatus and method for performing artificial neural network forward operations
CN109388427A (en) * 2017-08-11 2019-02-26 龙芯中科技术有限公司 Vector processing method, vector processing unit and microprocessor
CN111913746B (en) * 2020-08-31 2022-08-19 中国人民解放军国防科技大学 Design method of low-overhead embedded processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103885919A (en) * 2014-03-20 2014-06-25 北京航空航天大学 Multi-DSP and multi-FPGA parallel processing system and implement method
CN108369573A (en) * 2015-12-18 2018-08-03 英特尔公司 The instruction of operation for multiple vector elements to be arranged and logic
CN108459840A (en) * 2018-02-14 2018-08-28 中国科学院电子学研究所 A kind of SIMD architecture floating-point fusion point multiplication operation unit

Also Published As

Publication number Publication date
CN115826910A (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN110036368B (en) Apparatus and method for performing arithmetic operations to accumulate floating point numbers
JP6487097B2 (en) Perform rounding according to instructions
US7797363B2 (en) Processor having parallel vector multiply and reduce operations with sequential semantics
US7593978B2 (en) Processor reduction unit for accumulation of multiple operands with or without saturation
JP3729881B2 (en) Circuit and method for performing parallel addition and averaging
US7380112B2 (en) Processor and compiler for decoding an instruction and executing the decoded instruction with conditional execution flags
US6377970B1 (en) Method and apparatus for computing a sum of packed data elements using SIMD multiply circuitry
EP1403762A2 (en) Processor executing simd instructions
US20190095175A1 (en) Arithmetic processing device and arithmetic processing method
KR19980041798A (en) Module Computation Structure Supporting Commands for Image Processing
US7013321B2 (en) Methods and apparatus for performing parallel integer multiply accumulate operations
EP3769208B1 (en) Stochastic rounding logic
CN116820393A (en) Multi-precision multiply-add unit supporting deep learning instruction and application method thereof
CN115826910B (en) Vector fixed point ALU processing system
US7054898B1 (en) Elimination of end-around-carry critical path in floating point add/subtract execution unit
US5661674A (en) Divide to integer
CN109977701B (en) Fixed floating point arithmetic device
CN111752613A (en) Processing of iterative operations
US7363337B2 (en) Floating point divider with embedded status information
US6487576B1 (en) Zero anticipation method and apparatus
EP0992884B1 (en) Zero anticipation method and apparatus
CN114860319A (en) Interactive arithmetic device and execution method for SIMD (Single instruction multiple data) calculation instruction
US20060064451A1 (en) Arithmetic circuit
Huang et al. Hardware support for arithmetic units of processor with multimedia extension
CN113031914A (en) Control method, device and equipment of floating point rounding mode and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant