WO2022121090A1 - 支持高吞吐多精度乘法运算的处理器 - Google Patents
支持高吞吐多精度乘法运算的处理器 Download PDFInfo
- Publication number
- WO2022121090A1 WO2022121090A1 PCT/CN2021/073517 CN2021073517W WO2022121090A1 WO 2022121090 A1 WO2022121090 A1 WO 2022121090A1 CN 2021073517 W CN2021073517 W CN 2021073517W WO 2022121090 A1 WO2022121090 A1 WO 2022121090A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- precision
- register
- multiplication
- data
- general
- Prior art date
Links
- 101100098479 Caenorhabditis elegans glp-4 gene Proteins 0.000 claims description 9
- 238000000926 separation method Methods 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000000034 method Methods 0.000 abstract description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
- G06F7/487—Multiplying; Dividing
- G06F7/4876—Multiplying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30181—Instruction operation extension or modification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present invention relates to the technical field of general-purpose processors, in particular, to a RISC-V general-purpose processor that supports high-throughput multi-precision multiplication operations.
- general-purpose processors have not added multi-precision support to ordinary logic units.
- General-purpose processors still insist on using 32-bit or 64-bit word widths in the design of operation circuits. The main reasons are: 1) The operands in general-purpose workloads usually have different word widths. In order not to lose generality, general-purpose processors The internal operation unit bit width cannot be reduced to a low-precision bit width as quickly as a neural network accelerator. 2) In order to ensure backward compatibility, that is, the latest general-purpose processors can run old program codes, it is difficult to quickly change general-purpose processors to low-precision processors.
- a general-purpose processor wants to use low-precision operations to accelerate certain applications, while being able to implement 32-bit or 64-bit wide general-purpose computations, the general-purpose processor must be capable of multi-precision operations.
- multipliers occupy the core position. Therefore, current general-purpose processors have the defect that they cannot handle multi-precision calculations.
- the purpose of the present invention is to provide a RISC-V general-purpose processor that supports high-throughput multi-precision multiplication operations.
- a RISC-V general-purpose processor supporting high-throughput multi-precision multiplication operation includes an independent multiplier data path, and the multiplier data path combines the data path of the multi-precision multiplier with the data of other operation units.
- the paths are separated, and the multi-precision instructions can directly enter the register write-back stage after the execution stage without going through the memory access stage, reducing the use of pipeline registers and saving area and power consumption.
- the multi-precision multiplier has an independent data path and can Do a fast write of the result of the floating point multiplication to the vectorized register file VRF.
- the general-purpose register file GRF mainly provides integer-type operations for integer operation instructions.
- a vectorized register file VRF is added. It is used to provide floating-point operands to floating-point arithmetic instructions and low-precision floating-point multiply instructions.
- the vectorized register file VRF is set as two sets of independent register files, each set of register files has a width of 128 bits and a depth of 16, and each has two read and one write ports.
- the register address numbers of the first group of registers bank0 are all even numbers, that is, the register addresses from top to bottom are v0, v2, v4...v30, and the register address numbers of the second group of registers bank1 are odd numbers, that is, from top to bottom.
- the register addresses are v1, v3, v5...v31.
- all precision multiplication results enter the register write-back stage through a fixed delay, and the waiting data is the multiplication result of FP16, then the result data can be forwarded to the decoding stage in the first multiplication cycle; the required data is FP32 and FP64 multiplication results, then the multiplication result needs to be forwarded to the decoding stage in the second or third multiplication cycle.
- vfmul. ⁇ precision ⁇ vrd,vrs1,vrs2 are low-precision vector multiplication instructions, ⁇ precision ⁇ specifies the precision of the multiplication instruction, there are two options single(FP32) and half(FP16);
- vfmadd. ⁇ precision ⁇ vrd,vrs1, vrs2,vrs3 are low-precision vector multiply-accumulate instructions, vfmul.single can perform 4 FP32 multiplications, and vfmul.half can perform 16 FP16 multiplications;
- vld. ⁇ precision ⁇ vrd,rs1,imm are vector load instructions, used to load from memory Continuously read the data and send it into the vector register;
- vst. ⁇ precision ⁇ vrs1,rs2,imm are used to store the data in the vector register into the memory; ldcvt. ⁇ dprec ⁇ sprec ⁇ vrd,rs1,index use Store
- cvt. ⁇ dprec ⁇ sprec ⁇ rd,rs1 is used to convert the data in rs1 into a common scalar register after converting to precision. broadcast. ⁇ width ⁇ vrd,rs1 is used to copy the data of rs1 and store it in the vector register.
- the present invention has the following beneficial effects:
- a floating-point multiplier with three precisions FP64/FP32/FP16 is used as the basic multiplication unit, which can calculate one FP64 multiplication or four FP32 multiplication or 16 FP16 multiplications
- a microarchitecture of a multi-precision RISC-V processor is proposed to address bandwidth doubling, latency, data and structure conflicts when computing low-precision multiplications, while being able to perform conventional floating Dot multiplication operation.
- Fig. 1 is the multi-precision RISC-V processor architecture diagram of the present invention
- Fig. 2 provides a schematic diagram of operands for the register file of the present invention
- FIG. 3 is a schematic diagram of data forwarding of the multi-precision multiplier of the present invention.
- Fig. 4 is the forwarding detection circuit of the present invention.
- FIG. 5 is a RISC-V multi-precision extended instruction of the present invention.
- the present invention provides a RISC-V general-purpose processor that supports high-throughput multi-precision multiplication.
- a general-purpose RISC-V processor micro-architecture design based on high-throughput multi-precision multipliers is proposed.
- the five-stage pipeline design is fetch, decode, execute, fetch, and write back.
- the specific micro-architectural innovations are as follows:
- the multi-precision instruction can directly enter the register write-back stage after the execution stage without going through the memory access stage, which can reduce the use of pipeline registers and save area and power. consumption.
- the latency of the multi-precision multiplier may affect the performance of the Load/Store instruction, because other instructions can eliminate the effect of the delay by data forwarding, while the Load instruction only has the effect of After the fetch phase, the desired data can be obtained.
- the present invention combines the data path of multi-precision multipliers with other arithmetic units (such as integer adders, logical shifters, and floating-point additions). device) data path separation.
- the multi-precision multiplier has an independent data path, which can quickly write the result of floating-point multiplication to the vectorized register file (VRF)
- a processor's arithmetic unit when it has a fixed bit width w, it usually has a general-purpose register file with a depth of 32 and a bit width of w, with 2 read ports and 1 write port.
- a multi-precision multiplier when using a multi-precision multiplier, only 2 64-bit floating-point operands are required when calculating FP64; when calculating FP32 multiplication, since 4 FP32 multiplications can be calculated at one time, 8 32-bit operands are required , or 2 128-bit operands; when calculating FP16 multiplication, since 16 FP16 multiplication operations can be calculated at one time, 32 16-bit operands are required, or 2 256-bit operands.
- the throughput rate of low-precision is 4 times that of high-precision, which will result in 2 times the operand bandwidth required when calculating low-precision than when calculating high-precision. If three precisions are supported , then the operand bandwidth when computing the lowest precision is 4 times the bandwidth when computing the highest precision.
- the general-purpose register file GRF on the left side of Figure 2 mainly provides integer-type operations for integer operation instructions.
- a vectorized register file VRF is added to provide floating-point operation instructions and low-precision floating-point multiplication. Instructions provide floating-point operands.
- the vectorized register file VRF is set as two independent register files, each register file has a width of 128 bits and a depth of 16, with two read and one write ports.
- the register address numbers of the first group of registers bank0 are all even numbers, that is, the register addresses from top to bottom are v0, v2, v4...v30, and the register address numbers of the second group of registers bank1 are odd numbers, that is, from top to bottom.
- the register addresses are v1, v3, v5...v31.
- the two 64-bit floating-point operands can come from any two of the 32 vector registers, either in the same group (because each group has two register read ports), or in different group, since only two 64-bit operands are required, only the lower 64 bits of the two registers are read.
- the instruction is a low-precision multiply instruction to calculate FP32
- the two 128-bit floating-point operands can also come from any two of the 32 vector registers.
- the instruction is a low-precision multiplication instruction to calculate FP16
- two 256-bit floating-point operands need to come from four 128-bit floating-point registers.
- each group of registers has only two register read ports, each group of registers needs to have their own Provides two 128-bit operands. And because the instruction encoding format of RISC-V is limited, it cannot accommodate 4 source operand register addresses and 2 destination operand register addresses. Therefore, in the present invention, when using FP16 low-precision multiplication instructions, the source register and destination register are The addresses are forced to be set to an even register number, so that when reading the operand, when the hardware circuit detects that the opcode of the instruction is FP16 multiplication, it will read the data corresponding to the source register rs1 in the instruction and the odd number on the same line. Register data, pack the two data into a 256-bit operand, and read the data corresponding to the source register rs2 in the instruction and the odd-numbered register data in the same row, and pack the two data into another 256-bit operation number.
- the multi-precision multipliers used have different delays when calculating different precisions, for example, when calculating FP16, the result of FP16 multiplication can be obtained after one clock cycle, and the multiplication result of FP32 needs to be obtained after two cycles.
- the multiplication result is available after three cycles.
- the variable delay of the multiplier unit may lead to more data conflicts and structure conflicts.
- the multiplication results of the two instructions are valid at the same time, if they are submitted to the register write-back stage at the same time , resulting in a structural conflict.
- the multiplication results of all precisions enter the register write-back stage through a fixed delay, which avoids the structural conflict submitted at the same time;
- the write data conflicts if the waiting data is the multiplication result of FP16, then the result data can be forwarded to the decoding stage in the first multiplication cycle. If the required data is the multiplication result of FP32 and FP64, it needs to be in the second or the third multiplication cycle to forward the multiplication result to the decoding stage.
- the specific forwarding scheduling circuit is shown in Figure 4.
- the pipeline will be blocked; if it is the same and the opcode of M2 is FP32, the multiplication result data of M2 will be directly Forwarded to the decoding stage. Finally, check whether the source register vrs1 or vrs2 of the decoding stage is the same as the destination register of the M3 stage. If they are the same and the opcode of M3 is FP64, the multiplication result data of M3 is directly forwarded to the decoding stage.
- vfmul. ⁇ precision ⁇ vrd,vrs1,vrs2 are low-precision vector multiplication instructions, ⁇ precision ⁇ specifies the precision of the multiplication instruction, there are two options single (FP32) and half (FP16); vfmadd. ⁇ precision ⁇ vrd,vrs1,vrs2,vrs3 are low-precision vector multiply-accumulate instructions, vfmul.single can perform 4 FP32 multiplications, vfmul.half can perform 16 FP16 multiplications; vld. ⁇ precision ⁇ vrd,rs1,imm are vector loading Instructions are used to continuously read data from memory and send them into vector registers; vst. ⁇ precision ⁇ vrs1,rs2,imm are used to store data in vector registers into memory; ldcvt. ⁇ dprec ⁇ sprec ⁇ vrd, rs1, index are used to convert
- cvt. ⁇ dprec ⁇ sprec ⁇ rd,rs1 is used to convert the data in rs1 into a common scalar register after converting to precision. broadcast. ⁇ width ⁇ vrd,rs1 is used to copy the data of rs1 and store it in the vector register.
- the system provided by the present invention and its various devices can be implemented by logically programming the method steps. , modules, and units realize the same function in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, the system provided by the present invention and its various devices, modules and units can be regarded as a kind of hardware components, and the devices, modules and units included in it for realizing various functions can also be regarded as hardware components.
- the device, module and unit for realizing various functions can also be regarded as both a software module for realizing the method and a structure within a hardware component.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Nonlinear Science (AREA)
- Complex Calculations (AREA)
Abstract
本发明提供了一种支持高吞吐多精度乘法运算的RISC-V通用处理器,包括独立的乘法器数据通路,所述乘法器数据通路将多精度乘法器的数据通路与其他运算单元的数据通路分离,且多精度指令在执行阶段之后直接进入寄存器回写阶段而不用经过访存阶段,减少流水线寄存器的使用并节省面积和功耗,所述多精度乘法器具有独立的数据通路,进行将浮点乘法的结果写入矢量化寄存器文件VRF。本发明提供的一种支持高吞吐多精度乘法运算的RISC-V通用处理器能高效地处理多精度计算需求。
Description
本发明涉及通用处理器的技术领域,具体地,涉及一种支持高吞吐多精度乘法运算的RISC-V通用处理器。
自英特尔80386以来,在通用处理器中使用32位或64位字已成为常规的方法,在当今的算术逻辑单元(ALU)、体系结构和算法设计中,这已被视为常规方法。深度神经网络的流行使得加速神经网络成为了新的设计方向,已经可以通过对神经网络进行量化和压缩来获得位宽更小的权重数据,可以减少算力的需求和内存带宽的开销。例如,Google的TPU支持浮点格式为BF16的低精度格式;NVIDIA在其最新的GPU中加入了面向多精度计算的TensorCore核心,用以加速通用矩阵乘法。
到目前为止,通用处理器尚未在普通逻辑运算单元中加入多精度的支持。通用处理器在运算电路设计上仍然坚持使用32位或64位的字宽,主要原因有:1)在通用工作负载中的操作数通常具有不同的字宽,为了不失一般性,通用处理器不能像神经网络加速器那样快速的将内部的运算单元位宽降低为低精度位宽。2)为了保证向后兼容性,即可以最新的通用处理器可以运行老旧的程序代码,很难将通用处理器迅速更改为低精度处理器。
因此,如果通用处理器想要使用低精度运算对某些应用进行加速,同时能够实现32位宽或64位宽的通用计算,那么该通用处理器必须是具有多精度运算能力的。在通用处理器中所有的运算电路中,乘法器占据着核心地位,因此,目前的通用处理器具有无法处理多精度计算的缺陷。
发明内容
针对现有技术中的缺陷,本发明的目的是提供一种支持高吞吐多精度乘法运算的RISC-V通用处理器。
根据本发明提供的一种支持高吞吐多精度乘法运算的RISC-V通用处理器,包括独立的乘法器数据通路,所述乘法器数据通路将多精度乘法器的数据通路与其他运算单元的数据通路分离,且多精度指令可以在执行阶段之后直接进入寄存器回写阶段 而不用经过访存阶段,减少流水线寄存器的使用并节省面积和功耗,所述多精度乘法器具有独立的数据通路,可以进行快速的将浮点乘法的结果写入矢量化寄存器文件VRF。
优选地,还包括通用寄存器和向量寄存器结合的寄存器文件,所述通用寄存器文件GRF主要是给整数运算指令提供整数类型的操作,在通用寄存器文件的基础上,添加了矢量化寄存器文件VRF,用于给浮点运算指令和低精度浮点乘法指令提供浮点操作数。
优选地,所述矢量化寄存器文件VRF被设置为两组独立的寄存器文件,每组寄存器文件的宽度为128bit,深度为16,均具有两读一写端口。第一组寄存器bank0的寄存器地址编号均为偶数,即从上到下的寄存器地址分别为v0,v2,v4…v30,第二组寄存器bank1的寄存器地址编号均为奇数,即从上到下的寄存器地址分别为v1、v3、v5…v31。
优选地,所有精度的乘法结果均通过固定的延迟进入寄存器写回阶段,等待的数据为FP16的乘法结果,那么在第一个乘法周期便可以将结果数据转发至译码阶段;需要的数据为FP32和FP64的乘法结果,那么需要在第二个或第三个乘法周期才能将乘法结果转发至译码阶段。
优选地,以SIMD形式进行低精度乘法运算的扩展指令。vfmul.{precision}vrd,vrs1,vrs2为低精度矢量乘法指令,{precision}指定了乘法指令的精度,有两个选项single(FP32)和half(FP16);vfmadd.{precision}vrd,vrs1,vrs2,vrs3为低精度矢量乘累加指令,vfmul.single可以进行4个FP32乘法,vfmul.half可以进行16个FP16乘法;vld.{precision}vrd,rs1,imm为矢量加载指令,用于从存储器中连续读取数据并送入到向量寄存器中;vst.{precision}vrs1,rs2,imm用于将向量寄存器中的数据存入存储器中;ldcvt.{dprec}{sprec}vrd,rs1,index用于将rs1中的数据转换精度后存入到向量寄存器中。cvt.{dprec}{sprec}rd,rs1用于将rs1中的数据转换精度后存入到普通的标量寄存器。broadcast.{width}vrd,rs1用于将rs1的数据复制多份后存入向量寄存器。
与现有技术相比,本发明具有如下的有益效果:在本发明中,使用了具有三种精度FP64/FP32/FP16的浮点乘法器作为基本的乘法单元,可以计算一个FP64乘法或4个FP32乘法或16个FP16乘法,提出了一种多精度RISC-V处理器的微体系结构,以解决计算低精度乘法时的带宽翻倍、延迟、数据和结构冲突问题,同时可以执行常规的浮点乘法操作。
通过阅读参照以下附图对非限制性实施例所作的详细描述,本发明的其它特征、目的和优点将会变得更明显:
图1为本发明多精度RISC-V处理器架构图;
图2为本发明寄存器堆提供操作数示意图;
图3为本发明多精度乘法器数据转发示意图;
图4为本发明转发检测电路;
图5为本发明RISC-V多精度扩展指令。
下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明,但不以任何形式限制本发明。应当指出的是,对本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变化和改进。这些都属于本发明的保护范围。
本发明提供的一种支持高吞吐多精度乘法运算的RISC-V通用处理器,如图1所示,提出了基于高吞吐率多精度乘法器的通用RISC-V处理器微架构设计,具有基本的五级流水线设计,分别是取指、译码、执行、访存、回写。具体的微架构创新点如下:
1、独立的乘法器数据通路
由于使用多精度乘法器的指令将不涉及到数据的存储,因此多精度指令可以在执行阶段之后直接进入寄存器回写阶段而不用经过访存阶段,这样可以减少流水线寄存器的使用并节省面积和功耗。此外,如果多精度乘法指令使用与普通指令相同的数据路径,则多精度乘法器的延迟可能会影响Load/Store指令的性能,因为其他指令可以通过数据转发消除延迟的影响,而Load指令只有在经过访存阶段后才能得到想要的数据。
同时由于多精度乘法器在特定领域的计算中往往以高吞吐率为目标,因此本发明将多精度乘法器的数据通路其与其他运算单元(如整数加法器、逻辑移位器和浮点加法器)的数据通路分离。如图1所示,多精度乘法器具有独立的数据通路,可以进行快速的将浮点乘法的结果写入矢量化寄存器文件(VRF)
2、寄存器文件设计
在常规设计中,当一个处理器的运算单元具有固定的位宽w时,它通常会具有一个深度为32,位宽为w的通用寄存器文件,并且具有2个读端口和1个写端口。 当使用多精度乘法器时,在计算FP64时,只需要2个64位的浮点操作数;在计算FP32乘法时,由于可以一次性计算4个FP32乘法,因此需要8个32位的操作数,或者是2个128位的操作数;在计算FP16乘法时,由于一次可以计算16个FP16乘法操作,因此需要32个16位的操作数,或者是2个256位的操作数。因此,可以看到由于使用多精度乘法器,低精度的吞吐率是高精度的4倍,会导致在计算低精度时需要的操作数带宽是计算高精度时的2倍,如果支持三种精度,那么计算最低精度时的操作数带宽是计算最高精度时带宽的4倍。
为了解决计算不同精度时需要的带宽不同的问题,本发明设计了图2所示的寄存器文件结构。图2左侧的通用寄存器文件GRF主要是给整数运算指令提供整数类型的操作,在通用寄存器文件的基础上,添加了矢量化寄存器文件VRF,用于给浮点运算指令和低精度浮点乘法指令提供浮点操作数。
矢量化寄存器文件VRF被设置为两组独立的寄存器文件,每组寄存器文件的宽度为128bit,深度为16,均具有两读一写端口。第一组寄存器bank0的寄存器地址编号均为偶数,即从上到下的寄存器地址分别为v0,v2,v4…v30,第二组寄存器bank1的寄存器地址编号均为奇数,即从上到下的寄存器地址分别为v1、v3、v5…v31。
当指令为FP64乘法指令时,两个64位的浮点操作数可以来自32个向量寄存器种的任意两个,既可以在同一组(因为每一组有两个寄存器读端口),可以在不同组,由于只需要两个64位的操作数,因此只需读取两个寄存器的低64位。当指令为计算FP32的低精度乘法指令时,两个128位的浮点操作数同样可以来自32个向量寄存器种的任意两个。当指令为计算FP16的低精度乘法指令时,两个256位的浮点操作数需要来自4个128位的浮点寄存器,由于每组寄存器只有两个寄存器读端口,因此需要每一组寄存器各自提供两个128位的操作数。又由于RISC-V的指令编码格式有限,无法容纳下4个源操作数寄存器地址和2个目的操作数寄存器地址,因此在本发明中,在使用FP16低精度乘法指令时,源寄存器和目的寄存器的地址都被强制设置为偶数寄存器号,这样在读取操作数时,当硬件电路检测到指令的opcode为FP16乘法时,会读取指令中源寄存器rs1对应的数据和与其在同一行的奇数寄存器数据,将两者数据打包为一个256位的操作数,同时会读取指令中源寄存器rs2对应的数据和与其在同一行的奇数寄存器数据,将两者数据打包为另一个256位的操作数。
3、多精度指令调度
由于使用的多精度乘法器在计算不同精度时的延迟不同,例如在计算FP16时,可以在一个时钟周期后便得到FP16乘法的结果,FP32的乘法结果需要在两个周期后才能得到,FP64的乘法结果需要在三个周期后才能得到。乘法器单元可变的延迟可能会导致更多的数据冲突和结构冲突,当一条FP32乘法指令后面紧跟着一条FP16指令时,两条指令的乘法结果同时有效,如果同时提交至寄存器回写阶段,会导致结构冲突。在本发明中采用了图3所示的乘法器数据转发电路,所有精度的乘法结果均通过固定的延迟进入寄存器写回阶段,这就避免了同时提交的结构冲突;在产生了read-after-write数据冲突时,如果等待的数据为FP16的乘法结果,那么在第一个乘法周期便可以将结果数据转发至译码阶段,如果需要的数据为FP32和FP64的乘法结果,那么需要在第二个或第三个乘法周期才能将乘法结果转发至译码阶段。
具体的转发调度电路如图4所示,首先检测译码阶段的源寄存器vrs1或者vrs2是否与M1阶段的目的寄存器相同,如果相同且M1的opcode为FP32或FP64,则阻塞流水线;如果相同且M1的opcode为FP16,则将M1的乘法结果数据直接转发至译码阶段。然后检测译码阶段的源寄存器vrs1或者vrs2是否与M2阶段的目的寄存器相同,如果相同且M2的opcode为FP64,则阻塞流水线;如果相同且M2的opcode为FP32,则将M2的乘法结果数据直接转发至译码阶段。最后检测译码阶段的源寄存器vrs1或者vrs2是否与M3阶段的目的寄存器相同,如果相同且M3的opcode为FP64,则将M3的乘法结果数据直接转发至译码阶段。
4、RISC-V多精度扩展指令
本发明提出了以SIMD形式进行低精度乘法运算的扩展指令。如图5所示,vfmul.{precision}vrd,vrs1,vrs2为低精度矢量乘法指令,{precision}指定了乘法指令的精度,有两个选项single(FP32)和half(FP16);vfmadd.{precision}vrd,vrs1,vrs2,vrs3为低精度矢量乘累加指令,vfmul.single可以进行4个FP32乘法,vfmul.half可以进行16个FP16乘法;vld.{precision}vrd,rs1,imm为矢量加载指令,用于从存储器中连续读取数据并送入到向量寄存器中;vst.{precision}vrs1,rs2,imm用于将向量寄存器中的数据存入存储器中;ldcvt.{dprec}{sprec}vrd,rs1,index用于将rs1中的数据转换精度后存入到向量寄存器中。cvt.{dprec}{sprec}rd,rs1用于将rs1中的数据转换精度后存入到普通的标量寄存器。broadcast.{width}vrd,rs1用于将rs1的数据复制多份后 存入向量寄存器。
本领域技术人员知道,除了以纯计算机可读程序代码方式实现本发明提供的系统及其各个装置、模块、单元以外,完全可以通过将方法步骤进行逻辑编程来使得本发明提供的系统及其各个装置、模块、单元以逻辑门、开关、专用集成电路、可编程逻辑控制器以及嵌入式微控制器等的形式来实现相同功能。所以,本发明提供的系统及其各项装置、模块、单元可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置、模块、单元也可以视为硬件部件内的结构;也可以将用于实现各种功能的装置、模块、单元视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
以上对本发明的具体实施例进行了描述。需要理解的是,本发明并不局限于上述特定实施方式,本领域技术人员可以在权利要求的范围内做出各种变化或修改,这并不影响本发明的实质内容。在不冲突的情况下,本申请的实施例和实施例中的特征可以任意相互组合。
Claims (5)
- 一种支持高吞吐多精度乘法运算的RISC-V通用处理器,其特征在于,包括独立的乘法器数据通路,所述乘法器数据通路将多精度乘法器的数据通路与其他运算单元的数据通路分离,且多精度指令在执行阶段之后直接进入寄存器回写阶段而不用经过访存阶段,减少流水线寄存器的使用并节省面积和功耗,所述多精度乘法器具有独立的数据通路,进行将浮点乘法的结果写入矢量化寄存器文件VRF。
- 根据权利要求1所述的一种支持高吞吐多精度乘法运算的RISC-V通用处理器,其特征在于,还包括通用寄存器和向量寄存器结合的寄存器文件,所述通用寄存器文件GRF主要是给整数运算指令提供整数类型的操作,在通用寄存器文件的基础上,添加了矢量化寄存器文件VRF,用于给浮点运算指令和低精度浮点乘法指令提供浮点操作数。
- 根据权利要求2所述的一种支持高吞吐多精度乘法运算的RISC-V通用处理器,其特征在于,所述矢量化寄存器文件VRF被设置为两组独立的寄存器文件,每组寄存器文件的宽度为128bit,深度为16,均具有两读一写端口;第一组寄存器bank0的寄存器地址编号均为偶数,从上到下的寄存器地址分别为v0,v2,v4…v30,第二组寄存器bank1的寄存器地址编号均为奇数,从上到下的寄存器地址分别为v1、v3、v5…v31。
- 根据权利要求1所述的一种支持高吞吐多精度乘法运算的RISC-V通用处理器,其特征在于,所有精度的乘法结果均通过固定的延迟进入寄存器写回阶段,等待的数据为FP16的乘法结果,那么在第一个乘法周期便将结果数据转发至译码阶段;需要的数据为FP32和FP64的乘法结果,那么需要在第二个或第三个乘法周期才能将乘法结果转发至译码阶段。
- 根据权利要求1所述的一种支持高吞吐多精度乘法运算的RISC-V通用处理器,其特征在于,以SIMD形式进行低精度乘法运算的扩展指令;vfmul.{precision}vrd,vrs1,vrs2为低精度矢量乘法指令,{precision}指定了乘法指令的精度,有两个选项single(FP32)和half(FP16);vfmadd.{precision}vrd,vrs1,vrs2,vrs3为低精度矢量乘累加指令,vfmul.single进行4个FP32乘法,vfmul.half进行16个FP16乘法;vld.{precision}vrd,rs1,imm为矢量加载指令,用于从存储器中连续读取数据并送入到向量寄存器中;vst.{precision}vrs1,rs2,imm用于将向量寄存器中的数据存入存储器中;ldcvt.{dprec}{sprec}vrd, rs1,index用于将rs1中的数据转换精度后存入到向量寄存器中。cvt.{dprec}{sprec}rd,rs1用于将rs1中的数据转换精度后存入到普通的标量寄存器。broadcast.{width}vrd,rs1用于将rs1的数据复制多份后存入向量寄存器。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011424890.0 | 2020-12-09 | ||
CN202011424890.0A CN112506468B (zh) | 2020-12-09 | 2020-12-09 | 支持高吞吐多精度乘法运算的risc-v通用处理器 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022121090A1 true WO2022121090A1 (zh) | 2022-06-16 |
Family
ID=74971549
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/073517 WO2022121090A1 (zh) | 2020-12-09 | 2021-01-25 | 支持高吞吐多精度乘法运算的处理器 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112506468B (zh) |
WO (1) | WO2022121090A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722669B (zh) * | 2021-11-03 | 2022-01-21 | 海光信息技术股份有限公司 | 数据处理方法、装置、设备及存储介质 |
CN114117896B (zh) * | 2021-11-09 | 2024-07-26 | 上海交通大学 | 面向超长simd管线的二值规约优化实现方法及系统 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876892A (zh) * | 2010-05-20 | 2010-11-03 | 复旦大学 | 面向通信和多媒体应用的单指令多数据处理器电路结构 |
CN102184092A (zh) * | 2011-05-04 | 2011-09-14 | 西安电子科技大学 | 基于流水线结构的专用指令集处理器 |
US20140188968A1 (en) * | 2012-12-28 | 2014-07-03 | Himanshu Kaul | Variable precision floating point multiply-add circuit |
CN105608051A (zh) * | 2014-11-14 | 2016-05-25 | 凯为公司 | 在64位数据路径上实现128位simd操作 |
CN109918130A (zh) * | 2019-01-24 | 2019-06-21 | 中山大学 | 一种具有快速数据旁路结构的四级流水线risc-v处理器 |
CN110928832A (zh) * | 2019-10-09 | 2020-03-27 | 中山大学 | 异步流水线处理器电路、装置及数据处理方法 |
Family Cites Families (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5673407A (en) * | 1994-03-08 | 1997-09-30 | Texas Instruments Incorporated | Data processor having capability to perform both floating point operations and memory access in response to a single instruction |
WO1998006030A1 (en) * | 1996-08-07 | 1998-02-12 | Sun Microsystems | Multifunctional execution unit |
WO2002084451A2 (en) * | 2001-02-06 | 2002-10-24 | Victor Demjanenko | Vector processor architecture and methods performed therein |
FR2839224B1 (fr) * | 2002-04-30 | 2007-05-04 | Gemplus Card Int | Procede pour effectuer une phase de multiplication modulaire de deux operandes en multiprecision et cryptoprocesseur pour la mise en oeuvre du procede |
CN1259617C (zh) * | 2003-09-09 | 2006-06-14 | 大唐微电子技术有限公司 | 一种加快rsa加/解密过程的方法及其模乘、模幂运算电路 |
KR20050088506A (ko) * | 2004-03-02 | 2005-09-07 | 삼성전자주식회사 | 다중 세정도를 지원하는 확장형 몽고메리 모듈러 곱셈기 |
CN100461095C (zh) * | 2007-11-20 | 2009-02-11 | 浙江大学 | 一种支持多模式的媒体增强流水线乘法单元设计方法 |
CN101894096A (zh) * | 2010-06-24 | 2010-11-24 | 复旦大学 | 一种适用于cmmb和dvb-h/t的fft运算电路结构 |
CN101916180B (zh) * | 2010-08-11 | 2013-05-29 | 中国科学院计算技术研究所 | Risc处理器中执行寄存器类型指令的方法和其系统 |
CN104115115B (zh) * | 2011-12-19 | 2017-06-13 | 英特尔公司 | 用于多精度算术的simd整数乘法累加指令 |
US9292297B2 (en) * | 2012-09-14 | 2016-03-22 | Intel Corporation | Method and apparatus to process 4-operand SIMD integer multiply-accumulate instruction |
CN104767544B (zh) * | 2014-01-02 | 2018-08-24 | 深圳市中兴微电子技术有限公司 | 一种实现解扰解扩的方法和矢量运算器 |
CN104156195B (zh) * | 2014-08-19 | 2016-08-24 | 中国航天科技集团公司第九研究院第七七一研究所 | 扩展双精度的80位浮点处理单元在处理器中的集成系统及方法 |
CN105045560A (zh) * | 2015-08-25 | 2015-11-11 | 浪潮(北京)电子信息产业有限公司 | 一种定点乘加运算方法和装置 |
CN105335127A (zh) * | 2015-10-29 | 2016-02-17 | 中国人民解放军国防科学技术大学 | Gpdsp中支持浮点除法的标量运算单元结构 |
US20190073337A1 (en) * | 2017-09-05 | 2019-03-07 | Mediatek Singapore Pte. Ltd. | Apparatuses capable of providing composite instructions in the instruction set architecture of a processor |
US10867239B2 (en) * | 2017-12-29 | 2020-12-15 | Spero Devices, Inc. | Digital architecture supporting analog co-processor |
US11093579B2 (en) * | 2018-09-05 | 2021-08-17 | Intel Corporation | FP16-S7E8 mixed precision for deep learning and other algorithms |
CN109634558B (zh) * | 2018-12-12 | 2020-01-14 | 上海燧原科技有限公司 | 可编程的混合精度运算单元 |
FR3090932B1 (fr) * | 2018-12-20 | 2022-05-27 | Kalray | Système de multiplication de matrices par blocs |
CN110221808B (zh) * | 2019-06-03 | 2020-10-09 | 深圳芯英科技有限公司 | 向量乘加运算的预处理方法、乘加器及计算机可读介质 |
-
2020
- 2020-12-09 CN CN202011424890.0A patent/CN112506468B/zh active Active
-
2021
- 2021-01-25 WO PCT/CN2021/073517 patent/WO2022121090A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876892A (zh) * | 2010-05-20 | 2010-11-03 | 复旦大学 | 面向通信和多媒体应用的单指令多数据处理器电路结构 |
CN102184092A (zh) * | 2011-05-04 | 2011-09-14 | 西安电子科技大学 | 基于流水线结构的专用指令集处理器 |
US20140188968A1 (en) * | 2012-12-28 | 2014-07-03 | Himanshu Kaul | Variable precision floating point multiply-add circuit |
CN105608051A (zh) * | 2014-11-14 | 2016-05-25 | 凯为公司 | 在64位数据路径上实现128位simd操作 |
CN109918130A (zh) * | 2019-01-24 | 2019-06-21 | 中山大学 | 一种具有快速数据旁路结构的四级流水线risc-v处理器 |
CN110928832A (zh) * | 2019-10-09 | 2020-03-27 | 中山大学 | 异步流水线处理器电路、装置及数据处理方法 |
Also Published As
Publication number | Publication date |
---|---|
CN112506468B (zh) | 2023-04-28 |
CN112506468A (zh) | 2021-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9778911B2 (en) | Reducing power consumption in a fused multiply-add (FMA) unit of a processor | |
US10372668B2 (en) | Hardware processors and methods for tightly-coupled heterogeneous computing | |
US8880855B2 (en) | Dual register data path architecture with registers in a data file divided into groups and sub-groups | |
CN112099852A (zh) | 可变格式、可变稀疏矩阵乘法指令 | |
US8918445B2 (en) | Circuit which performs split precision, signed/unsigned, fixed and floating point, real and complex multiplication | |
US6349319B1 (en) | Floating point square root and reciprocal square root computation unit in a processor | |
US20120166511A1 (en) | System, apparatus, and method for improved efficiency of execution in signal processing algorithms | |
US9235414B2 (en) | SIMD integer multiply-accumulate instruction for multi-precision arithmetic | |
US9639369B2 (en) | Split register file for operands of different sizes | |
US10275247B2 (en) | Apparatuses and methods to accelerate vector multiplication of vector elements having matching indices | |
US6671796B1 (en) | Converting an arbitrary fixed point value to a floating point value | |
CN107918546B (zh) | 利用经掩码的全寄存器访问实现部分寄存器访问的处理器、方法和系统 | |
US20130339649A1 (en) | Single instruction multiple data (simd) reconfigurable vector register file and permutation unit | |
US6463525B1 (en) | Merging single precision floating point operands | |
US11474825B2 (en) | Apparatus and method for controlling complex multiply-accumulate circuitry | |
WO2022121090A1 (zh) | 支持高吞吐多精度乘法运算的处理器 | |
US6341300B1 (en) | Parallel fixed point square root and reciprocal square root computation unit in a processor | |
US7117342B2 (en) | Implicitly derived register specifiers in a processor | |
US7587582B1 (en) | Method and apparatus for parallel arithmetic operations | |
KR100636596B1 (ko) | 고에너지 효율 병렬 처리 데이터 패스 구조 | |
WO2002015000A2 (en) | General purpose processor with graphics/media support | |
JP5786719B2 (ja) | ベクトルプロセッサ | |
US11782719B2 (en) | Reconfigurable multi-thread processor for simultaneous operations on split instructions and operands | |
US20230094414A1 (en) | Matrix operation with multiple tiles per matrix dimension | |
CN116339826A (zh) | 用于四字特定部分的向量紧缩串接和移位的装置和方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21901809 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21901809 Country of ref document: EP Kind code of ref document: A1 |