CN117634569A - Quantized neural network acceleration processor based on RISC-V expansion instruction - Google Patents

Quantized neural network acceleration processor based on RISC-V expansion instruction Download PDF

Info

Publication number
CN117634569A
CN117634569A CN202311581806.XA CN202311581806A CN117634569A CN 117634569 A CN117634569 A CN 117634569A CN 202311581806 A CN202311581806 A CN 202311581806A CN 117634569 A CN117634569 A CN 117634569A
Authority
CN
China
Prior art keywords
module
instruction
bit
neural network
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311581806.XA
Other languages
Chinese (zh)
Other versions
CN117634569B (en
Inventor
黄科杰
刘佳沂
沈海斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202311581806.XA priority Critical patent/CN117634569B/en
Priority claimed from CN202311581806.XA external-priority patent/CN117634569B/en
Publication of CN117634569A publication Critical patent/CN117634569A/en
Application granted granted Critical
Publication of CN117634569B publication Critical patent/CN117634569B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a quantized neural network acceleration processor based on RISC-V (reduced instruction set computer-virtual machine) expansion instructions. The processor adopts a four-stage pipeline structure consisting of fetching, decoding, executing and writing back. The processor supports a self-defined and expanded high data bit width instruction set, so that the calculation parallelism is improved, and the operation efficiency is accelerated. Accordingly, the processor internally expands three data paths of different data bit widths and corresponding register sets to support packet quantization and parallel computation of the neural network. The processor realizes the rapid calculation of the convolution layer and the activation function through a calculation unit supporting an extended instruction set and a lookup table, and improves the adaptability and the calculation precision of the packet quantized neural network through dynamic fixed-point calculation. The invention has the advantages of high calculation efficiency, low energy consumption and wide application range.

Description

Quantized neural network acceleration processor based on RISC-V expansion instruction
Technical Field
The invention belongs to the technical field of neural network hardware acceleration and RISC-V instruction set expansion processors, and particularly relates to a quantized neural network acceleration processor based on RISC-V expansion instructions.
Background
With the further development of the internet of things, the real-time requirement is high, the requirements of large data transmission quantity such as real-time positioning, real-time environment detection, safety data acquisition, timely reporting of sensor and large-scale sensor data, intelligent manufacturing and the like all require the edge to have larger communication bandwidth, stable communication capability, acceleration capability for complex algorithms and large data processing capability under the condition of controllable cost.
But the resources of the internet of things terminal equipment are limited, and the computing resources of the on-board processor are usually limited. The hardware of the internet of things has become a bottleneck for limiting the development and landing of the AIoT edge, and the development of a low-power-consumption low-cost small-sized special processor for the AIoT edge is urgent.
Disclosure of Invention
The invention expands parallel instructions based on RISC-V instruction set, designs a low-power-consumption low-cost small special processor for AIoT edge, has the characteristics of high energy efficiency, high speed, low resource occupation and the like, and can support various data processing algorithms at the end side of the Internet of things. Therefore, the limitation of the limited resources and performance of the terminal equipment of the Internet of things and the limitation of AIoT development are eliminated.
To achieve the above object, the present invention provides a quantized neural network acceleration processor for RISC-V expansion instruction, comprising: the system comprises an instruction fetching module, a decoding module, an execution module, a write-back module, a data path module and a controller;
the instruction fetching module is used for acquiring instructions converted by the external neural network model from an external bus, judging whether the acquired instructions are compression instructions, if not, directly transmitting the acquired instructions to the decoding module, and if so, transmitting the compression instructions to the decoding module after interpretation;
the decoding module comprises a decoder and a register group, and the decoder is used for decoding the instruction transmitted by the instruction fetching module to obtain an instruction control signal; the register group is used for storing the address of the instruction control signal generated by the decoder and the write-back data transmitted by the write-back module;
the execution module comprises an arithmetic execution module and a status register, wherein the arithmetic execution module is used for executing specific calculation or performing memory access operation on a register set of the decoding module according to the instruction control signal obtained by decoding of the decoding module, and the status register is used for storing status information of each module in the processor;
the write-back module is used for carrying out data interaction with an external bus and writing the read data and the calculation result of the execution module back into a register group of the decoding module or an arithmetic execution module of the execution module;
the data path module comprises three data paths with the bit widths of 32 bits, 128 bits and 136 bits respectively; the 32-bit data path is used for transmitting 32-bit data among the decoding module, the executing module and the writing back module; the 128-bit data path is used for interacting with external data; the 136-bit data path is used for transmitting 136-bit data among the decoding module, the executing module and the writing back module;
the controller is used for controlling each module in the processor according to the state information of each module in the processor stored in the state register.
As a preferred embodiment of the present invention, the instruction fetch module includes an instruction interface (instruction interface), a prefetch buffer (prefetch buffer), and a compressed instruction decoder (compressed decoder), where the instruction interface is used to connect with an external bus; the prefetch buffer is used for acquiring the instruction in advance through the instruction interface, so that the delay of instruction access is reduced, the execution efficiency of the instruction is improved, and the compressed instruction decoder is used for judging whether the acquired instruction is a compressed instruction or not and interpreting the compressed instruction.
As a preferred aspect of the present invention, the instructions decoded in the decoder include an rv32ic instruction set and an extended instruction; the expansion instruction comprises a 128-bit read-in and write-out instruction, a 128-bit operation instruction, an instruction for calculating an activation function and a dynamic fixed point calculation instruction.
As a preferred embodiment of the present invention, the Register set includes a 32-bit general purpose Register set (General Purpose Register, GPR) for storing data processed by the rv32ic instruction set and a 136-bit Vector Register set (Vector Register) for storing data processed by the extended instruction set; the 136-bit vector register is physically composed of 30 columns of 128-bit data registers and 32 columns of 8-bit scaling factor registers, external bus-transmitted neural network model parameters are independently stored in the 128-bit data registers, and external bus-transmitted neural network scaling factors are independently stored in the 8-bit scaling factor registers; the output of the 136-bit vector register set concatenates the 128-bit neural network model parameters and the 8-bit neural network scaling factor into 136-bit source operands.
As a preferred embodiment of the present invention, the neural network quantization scaling factor is determined by an externally input neural network model, and the partial scaling factor is filled with 0 when a fixed-point operation is performed instead of a dynamic fixed-point operation.
As a preferable scheme of the invention, the addresses of the 32-bit general register and the 136-bit vector register are derived from control signals generated by a decoder, and the data written in the 32-bit general register and the 136-bit vector register are derived from data generated after an execution module executes an instruction or data written back by a write-back module.
As a preferred scheme of the invention, the execution module comprises a state register, a multiplication module, an arithmetic logic unit, a vector arithmetic logic unit and a lookup table, wherein the multiplication module is used for carrying out multiplication and multiplication addition operation of calculating 32-bit multiplication, vector fixed point and dynamic fixed point, and the arithmetic logic unit is used for calculating arithmetic logic operation of 32-bit data; the vector arithmetic logic unit is used for calculating vector dynamic fixed point and fixed point logic operation and nonlinear activation function calculation; the lookup table is used for storing a calculation result corresponding to the activation function and supporting calculation of the activation function in the expansion instruction.
As a preferable scheme of the invention, the multiplication and multiplication addition operation of the dynamic fixed point in the multiplication module and the quantity dynamic fixed point operation in the vector arithmetic logic unit are all dynamic fixed point calculation; the computing resources needed in the dynamic fixed point computation comprise 16 8-bit multipliers for multiplication computation, 16 8-bit adders for addition and multiplication addition, and 3 128-bit shifters; wherein the shifter is multiplexed in the multiplication operation, the addition operation, and the multiplication and addition operation; the adder and multiplier are multiplexed in a fixed point operation.
The invention also provides equipment comprising the processor.
Compared with the prior art, the invention has the following beneficial effects:
1) The invention expands convolution operation and activation function calculation on the basis of RISC-V instruction set aiming at neural network calculation, designs expanded instruction acceleration operation, reduces the instruction number of operation so as to reduce the cycle number required by calculation;
2) The method aims at the problem of precision loss caused by fixed-point operation, designs an additional dynamic fixed-point operation instruction, and simultaneously processes the condition of bit width change in the calculation process; and an expansion unit specially used for processing expansion parallel computing, activating function computing and dynamic fixed point computing is arranged on the processor, different bit widths are designed for instructions and data, efficient parallel data streams are designed, the total data reading and writing times are reduced, the parallelism is improved, and therefore energy consumption is reduced, and the operation speed is improved.
Drawings
FIG. 1 is a processor pipeline architecture diagram of the present invention;
FIG. 2 is a schematic diagram of a register according to the present invention;
FIG. 3 is a schematic diagram of a dynamic fixed-point computing process in the present invention;
FIG. 4 is a schematic diagram of the structure of the arithmetic unit when implementing dynamic fixed-point calculation in the present invention;
fig. 5 is a schematic diagram of the structure of an operation unit when the nonlinear activation function calculation is implemented in the present invention.
Detailed Description
The invention is further illustrated and described below in connection with specific embodiments. The described embodiments are merely exemplary of the present disclosure and do not limit the scope. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.
As shown in FIG. 1, an embodiment of the present invention provides a quantized neural network acceleration processor based on RISC-V extension instructions. The processor has a four-stage pipeline structure and comprises an instruction fetching module, a decoding module, an execution module, a write-back module and a controller. The instruction fetching module is used for acquiring instructions converted by the external neural network model from an external bus, judging whether the acquired instructions are compression instructions, if not, directly transmitting the acquired instructions to the decoding module, and if so, transmitting the compression instructions to the decoding module after interpretation;
the decoding module comprises a decoder and a register group, and the decoder is used for decoding the instruction transmitted by the instruction fetching module to obtain an instruction control signal; the register group is used for storing the address of the instruction control signal generated by the decoder and the write-back data transmitted by the write-back module;
the execution module comprises an arithmetic execution module and a status register, wherein the arithmetic execution module is used for executing specific calculation or performing memory access operation on a register set of the decoding module according to the instruction control signal obtained by decoding of the decoding module, and the status register is used for storing status information of each module in the processor;
the write-back module is used for carrying out data interaction with an external bus and writing the read data and the calculation result of the execution module back into a register group of the decoding module or an arithmetic execution module of the execution module;
the data path module comprises three data paths with the bit widths of 32 bits, 128 bits and 136 bits respectively; the 32-bit data path is used for transmitting 32-bit data among the decoding module, the executing module and the writing back module; the 128-bit data path is used for interacting with external data; the 136-bit data path is used for transmitting 136-bit data among the decoding module, the executing module and the writing back module;
the controller is used for controlling each module in the processor according to the state information of each module in the processor stored in the state register.
In one embodiment of the invention, the instruction fetch module includes an instruction interface (instruction interface), a prefetch buffer (prefetch buffer) and a compressed instruction decoder (compressed decoder). The prefetch buffer is used for acquiring the instruction in advance, so that the delay of instruction access is reduced, and the execution efficiency of the instruction is improved; the compressed instruction decoder is used for supporting an rv32c instruction set, and is used for judging whether the compressed instruction is a compressed instruction or not and decoding the compressed instruction into a normal instruction to be sent to the decoding module. The decode block includes a decoder (decoder), a 32-bit general purpose Register set (General Purpose Register, GPR), a 136-bit Vector Register set (Vector Register). The addresses of the 32-bit register and the 136-bit register are derived from control signals generated by a decoder, and the writing data are derived from push-forward data generated by an execution module or writing-back data generated by a writing-back module. The execution modules include a status register (Control and Status Register, CSR), a multiplication module (multiplexing unit), an arithmetic logic unit (Arithmetic Logic Unit, ALU), a vector arithmetic logic unit (Vector Arithmetic Logic Unit), a Look Up Table (LUT). Wherein the status register is used for storing control and status information of the processor; the multiplication module is used for performing multiplication and multiplication addition operation of calculating 32-bit multiplication, vector fixed point and dynamic fixed point; the vector arithmetic logic unit is used for calculating vector dynamic fixed point and fixed point logic operation and nonlinear activation function calculation. In the dynamic fixed point calculation part, the input operands include source operands and output scaling factors (fl_out) transmitted from the decoding stage. The write-back module writes back the calculation result of the execution module and the data read from the data interface (data interface) into the register. The controller is responsible for controlling various operations of the processor according to control signals input by the status register.
The processor supports a 32-bit instruction set, a 128-bit extended vector instruction set. Wherein the 32-bit data computation uses a 32-bit datapath inside the processor; the 128-bit fixed point computation and the nonlinear activation function computation use 136-bit data paths, and the upper 8 bits in the data paths will be filled with 0 s; the 128-bit dynamic fixed point computation uses 136-bit data paths, and the upper 8 bits in the data paths are filled with scaling factors corresponding to the set of data, and the quantized scaling factors are determined by an externally input neural network model. The data path between the data interface and the load-store module is 128 bits, and writing from the load-store module to the register requires bit splicing of 128 bits of data, resulting in 136 bits of data with 0 in the upper 8 bits.
As shown in fig. 2, the vector register set in the processor is physically composed of 30 128-bit data registers and 32 column 8-bit scaling factor registers. When the memory read instruction (lw 128) and the memory write instruction (sw 128) are executed, the register set is considered to be a 32-128-bit data register set, and the upper 8 bits of the input are discarded. When the address is less than 30, the data is stored in the corresponding data register; when the address is 30, the data is stored in the scale factor register 0-15; when the address is 31, the data will be stored in the scale factor registers 16-31. When executing the dynamic fixed point calculation instruction, 136 bits of data of the input register are split into high 8 bits of scaling factor and low 128 bits of data, and are respectively stored into the corresponding data register and the scaling factor register according to the input address.
As shown in FIG. 3, in one embodiment of the present invention, the processor of the present invention includes a data representation of dynamic fixed point calculations and a dynamic fixed point addition calculation. A dynamic fixed point number consists of an 8-bit data and an 8-bit scaling factor, the most significant bit of the data being the sign bit. The size of a data actually represented by a dynamic fixed point number is equal to the power fl of the data itself multiplied by 2. When two data with different scaling factors are added, the two data need to be shifted so that the integer parts of the two data are aligned to the format determined by the output scaling factor. And then executing addition of the fixed point number to obtain an output value. In the design of the present invention, 128 bits of data are 16 parallel 8 bits of data, the 16 data sharing the same scaling factor.
As shown in fig. 4, the computing resources required by the present invention in performing dynamic fixed point computation include 16 8-bit multipliers (multipliers) for multiplication computation, 16 8-bit adders (adders) for addition and multiply-add operations, 3 128-bit shifters (shifters). When the addition operation is performed, the scaling factors of the source operand a (operand_a) and the source operand b (operand_b) and the scaling factors (fl, fractional length) of the appointed output are respectively calculated to obtain a shifting mode of the source operand a and the source operand b, and after shifting, the shifting mode is added by the parallel 16 8-bit adders to obtain the output; when multiplication is executed, the source operand a and the source operand b are subjected to multiplication computation by 16 parallel 8-bit multipliers, data exceeding 8 bits in the multiplication computation are omitted, then the scaling factors of the source operand a and the source operand b and the scaling factors of the appointed output are calculated, and the output of the multipliers is subjected to shift operation to obtain the output of the multiplication operation; when the multiply-add operation is executed, the source operand a and the source operand b perform dynamic fixed-point multiply operation, and the obtained result and the source operand c (operand_c) perform dynamic fixed-point add operation to obtain the output of the final multiply-add operation.
As shown in fig. 5, the present invention stores the function values of the tanh function at different inputs in the lookup table when performing the nonlinear activation function calculation. When performing the calculation, the controller (controller) will obtain the address of the lookup table according to the input value mapping (mapping), and take the corresponding function value from the memory (memory) in the lookup table. When the input value is small, the input value will be selected directly as output. When calculating the tanh function, adding sign bits for the output of the last step as a final calculation result; when the sigmoid function is calculated, the final result is the output of the last step minus 0.5, and sign bits are added.
The invention also provides a device comprising a quantized neural network acceleration processor based on RISC-V extension instructions.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the invention.

Claims (9)

1. A quantized neural network acceleration processor for RISC-V extended instructions, comprising: the system comprises an instruction fetching module, a decoding module, an execution module, a write-back module, a data path module and a controller;
the instruction fetching module is used for acquiring instructions converted by the external neural network model from an external bus, judging whether the acquired instructions are compression instructions, if not, directly transmitting the acquired instructions to the decoding module, and if so, transmitting the compression instructions to the decoding module after interpretation;
the decoding module comprises a decoder and a register group, and the decoder is used for decoding the instruction transmitted by the instruction fetching module to obtain an instruction control signal; the register group is used for storing the address of the instruction control signal generated by the decoder and the write-back data transmitted by the write-back module;
the execution module comprises an arithmetic execution module and a status register, wherein the arithmetic execution module is used for executing specific calculation or performing memory access operation on a register set of the decoding module according to the instruction control signal obtained by decoding of the decoding module, and the status register is used for storing status information of each module in the processor;
the write-back module is used for carrying out data interaction with an external bus and writing the read data and the calculation result of the execution module back into a register group of the decoding module or an arithmetic execution module of the execution module;
the data path module comprises three data paths with the bit widths of 32 bits, 128 bits and 136 bits respectively; the 32-bit data path is used for transmitting 32-bit data among the decoding module, the executing module and the writing back module; the 128-bit data path is used for interacting with external data; the 136-bit data path is used for transmitting 136-bit data among the decoding module, the executing module and the writing back module;
the controller is used for controlling each module in the processor according to the state information of each module in the processor stored in the state register.
2. The quantized neural network acceleration processor of claim 1, wherein the fetch module comprises an instruction interface, a prefetch buffer, and a compressed instruction decoder, the instruction interface being configured to connect with an external bus; the prefetch buffer is used for acquiring the instruction in advance through the instruction interface, so that the delay of instruction access is reduced, the execution efficiency of the instruction is improved, and the compressed instruction decoder is used for judging whether the acquired instruction is a compressed instruction or not and interpreting the compressed instruction.
3. The quantized neural network acceleration processor of claim 1, wherein the instructions decoded in the decoder comprise an rv32ic instruction set and an expand instruction; the expansion instruction comprises a 128-bit read-in and write-out instruction, a 128-bit operation instruction, an instruction for calculating an activation function and a dynamic fixed point calculation instruction.
4. The quantized neural network acceleration processor of claim 3, wherein the register set comprises a 32-bit general purpose register set and a 136-bit vector register set, the 32-bit general purpose register set for storing data processed by an rv32ic instruction set, the 136-bit vector register set for storing data processed by an extended instruction set; the 136-bit vector register is physically composed of 30 columns of 128-bit data registers and 32 columns of 8-bit scaling factor registers, external bus-transmitted neural network model parameters are independently stored in the 128-bit data registers, and external bus-transmitted neural network scaling factors are independently stored in the 8-bit scaling factor registers; the output of the 136-bit vector register set concatenates the 128-bit neural network model parameters and the 8-bit neural network scaling factor into 136-bit source operands.
5. The quantized neural network acceleration processor of claim 4, wherein the neural network quantized scaling factor is determined by an externally input neural network model, the partial scaling factor being filled with 0 when performing fixed-point operations instead of dynamic fixed-point operations.
6. The quantized neural network acceleration processor of claim 4, wherein the addresses of the 32-bit general purpose registers and the 136-bit vector registers are derived from instruction control signals generated by a decoder, and the data written in the 32-bit general purpose registers and the 136-bit vector registers are derived from data generated after the execution of the instructions by the execution module or data written back by the write-back module.
7. The quantized neural network acceleration processor of claim 1, wherein the execution module comprises a status register, a multiplication module, an arithmetic logic unit, a vector arithmetic logic unit, and a lookup table, the multiplication module is configured to perform a multiplication and multiplication-addition operation for calculating 32-bit multiplication, a multiplication and multiplication-addition operation for vector fixed-point and dynamic fixed-point, and the arithmetic logic unit is configured to calculate an arithmetic logic operation for 32-bit data; the vector arithmetic logic unit is used for calculating vector dynamic fixed point and fixed point logic operation and nonlinear activation function calculation; the lookup table is used for storing a calculation result corresponding to the activation function and supporting calculation of the activation function in the expansion instruction.
8. The quantized neural network acceleration processor of claim 7, wherein the multiplication and multiply-add operations of the dynamic fixed point in the multiplication module and the quantitative dynamic fixed point operations in the vector arithmetic logic unit are dynamic fixed point calculations; the computing resources needed in the dynamic fixed point computation comprise 16 8-bit multipliers for multiplication computation, 16 8-bit adders for addition and multiplication addition, and 3 128-bit shifters; wherein the shifter is multiplexed in the multiplication operation, the addition operation, and the multiplication and addition operation; the adder and multiplier are multiplexed in a fixed point operation.
9. An apparatus comprising the processor of any of claims 1-8.
CN202311581806.XA 2023-11-24 Quantized neural network acceleration processor based on RISC-V expansion instruction Active CN117634569B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311581806.XA CN117634569B (en) 2023-11-24 Quantized neural network acceleration processor based on RISC-V expansion instruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311581806.XA CN117634569B (en) 2023-11-24 Quantized neural network acceleration processor based on RISC-V expansion instruction

Publications (2)

Publication Number Publication Date
CN117634569A true CN117634569A (en) 2024-03-01
CN117634569B CN117634569B (en) 2024-06-28

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118092853A (en) * 2024-04-26 2024-05-28 中科亿海微电子科技(苏州)有限公司 Instruction set expansion method and device based on RISC-V floating point overrunning function

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110007961A (en) * 2019-02-01 2019-07-12 中山大学 A kind of edge calculations hardware structure based on RISC-V
CN111915003A (en) * 2019-05-09 2020-11-10 深圳大普微电子科技有限公司 Neural network hardware accelerator
CN114239806A (en) * 2021-12-16 2022-03-25 浙江大学 RISC-V structured multi-core neural network processor chip
CN114911526A (en) * 2022-06-01 2022-08-16 中国人民解放军国防科技大学 Brain-like processor based on brain-like instruction set and application method thereof
WO2023016481A1 (en) * 2021-08-13 2023-02-16 华为技术有限公司 Data processing method and related apparatus
US20230074229A1 (en) * 2020-02-05 2023-03-09 The Trustees Of Princeton University Scalable array architecture for in-memory computing
CN115983348A (en) * 2023-02-08 2023-04-18 天津大学 RISC-V accelerator system supporting convolution neural network extended instruction
CN116258185A (en) * 2023-01-11 2023-06-13 阿里巴巴(中国)有限公司 Processor, convolution network computing method with variable precision and computing equipment
CN116432765A (en) * 2023-01-16 2023-07-14 浙江大学 RISC-V-based special processor for post quantum cryptography algorithm
CN116700796A (en) * 2023-05-29 2023-09-05 中国人民解放军93216部队 Implementation architecture and method of RISC-V information security expansion instruction on five-stage pipeline structure

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110007961A (en) * 2019-02-01 2019-07-12 中山大学 A kind of edge calculations hardware structure based on RISC-V
CN111915003A (en) * 2019-05-09 2020-11-10 深圳大普微电子科技有限公司 Neural network hardware accelerator
US20230074229A1 (en) * 2020-02-05 2023-03-09 The Trustees Of Princeton University Scalable array architecture for in-memory computing
WO2023016481A1 (en) * 2021-08-13 2023-02-16 华为技术有限公司 Data processing method and related apparatus
CN114239806A (en) * 2021-12-16 2022-03-25 浙江大学 RISC-V structured multi-core neural network processor chip
CN114911526A (en) * 2022-06-01 2022-08-16 中国人民解放军国防科技大学 Brain-like processor based on brain-like instruction set and application method thereof
CN116258185A (en) * 2023-01-11 2023-06-13 阿里巴巴(中国)有限公司 Processor, convolution network computing method with variable precision and computing equipment
CN116432765A (en) * 2023-01-16 2023-07-14 浙江大学 RISC-V-based special processor for post quantum cryptography algorithm
CN115983348A (en) * 2023-02-08 2023-04-18 天津大学 RISC-V accelerator system supporting convolution neural network extended instruction
CN116700796A (en) * 2023-05-29 2023-09-05 中国人民解放军93216部队 Implementation architecture and method of RISC-V information security expansion instruction on five-stage pipeline structure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
傅思扬;陈华;郁发新;: "基于RISC-V的卷积神经网络处理器设计与实现", 微电子学与计算机, no. 04, 5 April 2020 (2020-04-05) *
李理;: "2019年边缘计算技术发展研究", 无人系统技术, no. 02, 15 March 2020 (2020-03-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118092853A (en) * 2024-04-26 2024-05-28 中科亿海微电子科技(苏州)有限公司 Instruction set expansion method and device based on RISC-V floating point overrunning function

Similar Documents

Publication Publication Date Title
US11531540B2 (en) Processing apparatus and processing method with dynamically configurable operation bit width
US20170364476A1 (en) Instruction and logic for performing a dot-product operation
RU2263947C2 (en) Integer-valued high order multiplication with truncation and shift in architecture with one commands flow and multiple data flows
WO2019218896A1 (en) Computing method and related product
US6944747B2 (en) Apparatus and method for matrix data processing
RU2275677C2 (en) Method, device and command for performing sign multiplication operation
US10275247B2 (en) Apparatuses and methods to accelerate vector multiplication of vector elements having matching indices
CN112069459A (en) Accelerator for sparse-dense matrix multiplication
CN110163360B (en) Computing device and method
JP3750820B2 (en) Device for performing multiplication and addition of packed data
TWI493453B (en) Microprocessor, video decoding device, method and computer program product for enhanced precision sum-of-products calculation on a microprocessor
WO2014004397A1 (en) Vector multiplication with accumulation in large register space
CN112470139A (en) Compact arithmetic accelerator for data processing apparatus, system and method
JP4349265B2 (en) Processor
US9417843B2 (en) Extended multiply
WO2014004394A1 (en) Vector multiplication with operand base system conversion and re-conversion
US20160266902A1 (en) Instruction and logic to provide vector linear interpolation functionality
CN111045728B (en) Computing device and related product
US5958000A (en) Two-bit booth multiplier with reduced data path width
CN111459546B (en) Device and method for realizing variable bit width of operand
CN117634569B (en) Quantized neural network acceleration processor based on RISC-V expansion instruction
WO2013095558A1 (en) Method, apparatus and system for execution of a vector calculation instruction
CN117634569A (en) Quantized neural network acceleration processor based on RISC-V expansion instruction
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card
Kageyama et al. Implementation of Floating‐Point Arithmetic Processing on Content Addressable Memory‐Based Massive‐Parallel SIMD matriX Core

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant