WO2022134729A1 - 一种基于risc-v的人工智能推理方法和系统 - Google Patents

一种基于risc-v的人工智能推理方法和系统 Download PDF

Info

Publication number
WO2022134729A1
WO2022134729A1 PCT/CN2021/122287 CN2021122287W WO2022134729A1 WO 2022134729 A1 WO2022134729 A1 WO 2022134729A1 CN 2021122287 W CN2021122287 W CN 2021122287W WO 2022134729 A1 WO2022134729 A1 WO 2022134729A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
instruction
data
register
memory
Prior art date
Application number
PCT/CN2021/122287
Other languages
English (en)
French (fr)
Inventor
贾兆荣
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Priority to US18/246,662 priority Critical patent/US11880684B2/en
Publication of WO2022134729A1 publication Critical patent/WO2022134729A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • G06F9/30038Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30109Register structure having multiple operands in a single register
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30112Register structure comprising data of variable length
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30116Shadow registers, e.g. coupled registers, not forming part of the register space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • G06F9/30189Instruction operation extension or modification according to execution mode, e.g. mode flag
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute

Definitions

  • This application relates to the field of artificial intelligence, and more specifically, to a RISC-V-based artificial intelligence reasoning method and system.
  • AI Artificial Intelligence, artificial intelligence chips are roughly classified as follows: From the perspective of application scenarios, AI chips mainly have two directions, one is the cloud deployed in the data center, and the other is the terminal deployed in the consumer terminal. From a functional point of view, AI chips mainly do two things, one is Training (training), and the other is Inference (reasoning). At present, the large-scale applications of AI chips are in the cloud and terminals. AI chips in the cloud do two things at the same time: Training and Inference.
  • Training is to use a large amount of labeled data to "train” the corresponding system so that it can adapt to specific functions, such as giving the system a large number of "cat” pictures, and telling the system that this is a "cat”, and then the system "knows” What is a cat? Inference is to use the trained system to complete the task. Following the above example, you give a picture to the previously trained system and let it come to the conclusion that the picture is a cat or not.
  • the AI chips in the cloud are currently mainly GPU (graphics processing unit, graphics processor). Due to the large amount of data, computing power, and power consumption required for training, large-scale heat dissipation is required. Training will focus on In the cloud, the completion of Inference is currently mainly concentrated in the cloud, but with the efforts of more and more manufacturers, many applications will gradually be transferred to the terminal, such as the autonomous driving chips that are currently used more.
  • Completing inference at the terminal is mainly to meet the low latency requirements of the terminal.
  • the delay of cloud inference is related to the network, and the delay is generally large, which is difficult to meet the needs of the terminal (such as autonomous driving); meet the diverse needs of the terminal; and preliminary screening terminal data, transfer valid data to the cloud, etc.
  • RISC-V is an open source instruction set architecture (ISA) based on reduced instruction set (RISC) principles. Compared with most instruction sets, the RISC-V instruction set can be freely used for any purpose, allowing anyone to design, manufacture and Sell RISC-V chips and software without paying any company royalties. While this is not the first open source instruction set, it is significant because its design makes it suitable for use in modern computing devices (such as warehouse-scale cloud computers, high-end mobile phones, and tiny embedded systems). The designers considered performance and power efficiency in these applications. The instruction set also has numerous supported software, which addresses the usual weaknesses of the new instruction set.
  • the design of the RISC-V instruction set takes into account the reality of small, fast, and low power consumption, but does not overdesign a specific microarchitecture. Because the instruction set is located between hardware and software, it is the main communication bridge of the computer. Therefore, if there is a well-designed instruction set that is open source and can be used by anyone, more resources can be reused, and the to reduce software costs. And such an instruction set would also increase the competitiveness of the hardware supplier market, because hardware suppliers can divert more resources to design and deal with less software-supported transactions. However, the RISC-V instruction set lacks the hardware design and software support of the processor, which makes it unable to be used for inference computation of AI chips.
  • the present application discloses a RISC-V-based artificial intelligence inference method, comprising performing the following steps:
  • responding that the instruction is a vector instruction refers to determining that the instruction of artificial intelligence inference includes a vector instruction, and responding to the vector instruction.
  • memory includes vector data memory, instruction memory, and scalar data memory; registers include vector registers and scalar registers.
  • the step of loading data from the memory to the corresponding register based on the instruction includes: determining the number of vector operations in a single operation based on an environment parameter, and loading the vector data of the number of vector operations in a single operation according to a vector load instruction in the instruction to the vector register.
  • the artificial intelligence inference method based on RISC-V further includes the following steps: determining, by the convolution control unit, an environment parameter based on a register configuration instruction in the instruction, where the environment parameter includes the effective bit width of the vector, the number of vector registers in each group , register bit width, and the number of vectors currently required to operate;
  • the step of determining the number of single vector operations based on environmental parameters includes: determining the maximum number of allowable operation vectors according to the register bit width, the effective bit width of the vector and the number of each set of vector registers, and determining the maximum number of allowable operation vectors and the current required operation The smaller of the number of vectors is determined as the number of single vector operations.
  • the step of processing the corresponding vector data in the vector processing unit by the convolution control unit based on the vector instruction includes:
  • the shadow register of the vector processing unit is empty, and the convolution control unit allows, buffering the vector data from the vector register to the shadow register;
  • the vector data is sequentially reordered and preprocessed in the shadow register, and stored in the multiplier input buffer of the vector processing unit;
  • the vector data is obtained from the multiply-accumulator by the vector activation unit of the vector processing unit to perform nonlinear vector operations using a look-up table under the control of the convolution control unit.
  • the artificial intelligence inference method based on RISC-V further comprises the steps of: configuring the cache area of the lookup table for the vector activation unit by the convolution control unit based on the lookup table activation instruction in the instruction; and
  • a product operation, an accumulation operation, or a nonlinear vector operation is selectively performed on the vector data by the convolution control unit based on real-time control instructions in the instructions.
  • the artificial intelligence reasoning method based on RISC-V further comprises the following steps:
  • responding to the instruction being a scalar instruction refers to determining that the instruction for artificial intelligence inference also includes a scalar instruction, and responding to the scalar instruction.
  • a RISC-V-based artificial intelligence inference system includes a processor and a memory; the memory stores computer-readable instructions executable by the processor, and the computer-readable instructions are executed by the processor. , causes the processor to perform the following steps:
  • responding that the instruction is a vector instruction refers to determining that the instruction of artificial intelligence inference includes a vector instruction, and responding to the vector instruction.
  • the memory includes vector data memory, instruction memory, and scalar data memory;
  • the registers include vector registers and scalar registers;
  • the step of loading data from the memory into the corresponding register based on the instruction includes: determining a single-shot vector based on environmental parameters The number of operations, according to the vector load instruction in the instruction, the vector data of the number of single vector operations is loaded into the vector register;
  • the processor also performs the following steps when executing the computer-readable instructions: the convolution control unit determines an environment parameter based on a register configuration instruction in the instruction, and the environment parameter includes the effective bit width of the vector, the number of each set of vector registers, the register bit width, and The current number of operation vectors is required; determining the number of single vector operations based on environmental parameters includes: determining the maximum number of allowed operation vectors according to the register bit width, vector effective bit width multiplication and the number of each set of vector registers, and will allow the maximum number of operation vectors The smaller of the number and the number of vectors currently required to operate is determined as the number of single vector operations.
  • the step of processing the corresponding vector data in the vector processing unit by the convolution control unit based on the vector instruction includes: in response to the vector register having data, the shadow register of the vector processing unit being empty, and the convolution control unit allowing,
  • the vector data is cached from the vector register to the shadow register; the vector data is sequentially reordered and preprocessed in the shadow register, and stored in the multiplier input buffer of the vector processing unit;
  • the input buffer obtains vector data to perform a product operation under the control of the convolution control unit; obtains the vector data from the multiplier array by the multiply-accumulator of the vector processing unit to perform the accumulation operation under the control of the convolution control unit; and is processed by the vector
  • the unit's vector activation unit obtains vector data from the multiply-accumulator to perform nonlinear vector operations using a look-up table under the control of the convolution control unit.
  • one or more non-volatile computer-readable storage media that store computer-readable instructions that, when executed by one or more processors, cause the one or more processors to Perform the following steps:
  • responding that the instruction is a vector instruction refers to determining that the instruction of artificial intelligence inference includes a vector instruction, and responding to the vector instruction.
  • FIG. 1 is a schematic flowchart of an artificial intelligence reasoning method based on RISC-V in some embodiments of the present application
  • FIG. 2 is a schematic block diagram of an artificial intelligence reasoning method based on RISC-V in some embodiments of the present application
  • FIG. 3 is a flowchart of an instruction acquisition of an artificial intelligence inference method based on RISC-V in some embodiments of the present application;
  • FIG. 4 is a structural diagram of a vector processing unit of an artificial intelligence inference method based on RISC-V in some embodiments of the present application;
  • FIG. 5 is a flowchart of vector reordering of the artificial intelligence inference method based on RISC-V in some embodiments of the present application;
  • FIG. 6 is an overall flowchart of a vector processing unit of a RISC-V-based artificial intelligence inference method in some embodiments of the present application.
  • FIG. 7 is a schematic diagram of a convolution control unit of an artificial intelligence inference method based on RISC-V in some embodiments of the present application;
  • FIG. 8 is a schematic diagram of a processor on-chip interconnection structure of an artificial intelligence inference method based on RISC-V in some embodiments of the present application.
  • FIG. 1 shows a schematic flowchart of an artificial intelligence reasoning method based on RISC-V in this embodiment.
  • the artificial intelligence reasoning method based on RISC-V includes the following steps:
  • Step S101 obtain the instruction and data of artificial intelligence inference through the direct memory access interface and write into the memory
  • Step S103 obtain and translate the instruction from the memory, and load the data from the memory into the corresponding register based on the instruction;
  • Step S105 in response to the instruction being a vector instruction, the convolution control unit processes the corresponding vector data in the vector processing unit based on the vector instruction;
  • step S107 the processed vector data is fed back to complete inference.
  • the instruction in step S103 refers to an instruction of artificial intelligence inference.
  • Responding to the instruction being a vector instruction refers to determining that the instruction for artificial intelligence reasoning includes a vector instruction, and responding to the vector instruction.
  • the present application discloses an AI chip architecture based on the RISC-V instruction set, which can complete convolution calculation or matrix calculation, and can also be used as an AI inference accelerator to assist the processor to complete convolution/matrix calculation. Since it is fully compatible with the RISC-V reduced instruction set, the present application can further develop on the RISC-V software tool chain, thus greatly reducing the development difficulty of the software tool chain.
  • the core of the design of this application is the convolution operation architecture based on the RISC-V instruction set, which can complete scalar operations, vector operations, convolution operations, matrix operations, nonlinear activation operations, etc., and can meet all computing needs of artificial intelligence reasoning.
  • AI chips can also be interconnected through the on-chip mesh (Mesh) interconnection network (NoC) to form a larger computing power architecture to meet different terminal computing power requirements.
  • Mesh on-chip mesh
  • NoC interconnection network
  • the memory includes vector data memory, instruction memory, and scalar data memory; the registers include vector registers and scalar registers.
  • the step of loading data from the memory into the corresponding register based on the instruction includes: determining the number of vector operations in a single vector operation based on environmental parameters, so as to use the vector load instruction in the instruction to load the vector data in the vector data in a single vector operation. Load into the vector register (that is, determine the number of vector operations for a single time based on the environment parameters, and load the vector data of the number of vector operations for a single time into the vector register according to the vector load instruction in the instruction).
  • the artificial intelligence inference method based on RISC-V further includes the following steps: determining, by the convolution control unit, an environment parameter based on a register configuration instruction in the instruction, where the environment parameter includes the effective bit width of the vector, the number of vector registers in each group , register bit width, and the number of vectors currently required to operate;
  • the step of determining the number of single vector operations based on the environment parameters includes: determining the maximum number of allowable operation vectors based on the register bit width divided by the effective bit width of the vector multiplied by the number of each set of vector registers, and combining the maximum number of allowable operation vectors with the sum of The smaller value of the number of vectors currently required to be operated is determined as the number of single vector operations.
  • the step of processing the corresponding vector data in the vector processing unit by the convolution control unit based on the vector instruction includes:
  • the shadow register of the vector processing unit is empty, and the convolution control unit allows, buffering the vector data from the vector register to the shadow register;
  • the vector data is sequentially reordered and preprocessed in the shadow register, and stored in the multiplier input buffer of the vector processing unit;
  • the vector data is obtained from the multiply-accumulator by the vector activation unit of the vector processing unit to perform nonlinear vector operations using a look-up table under the control of the convolution control unit.
  • the method further includes: configuring, by the convolution control unit, a buffer area of the lookup table for the vector activation unit based on the lookup table activation instruction in the instruction;
  • the vector processing unit is caused to selectively perform a product operation, an accumulation operation, or a nonlinear vector operation by the convolution control unit based on real-time control instructions in the instructions.
  • the artificial intelligence reasoning method based on RISC-V further includes performing the following steps:
  • responding to the instruction being a scalar instruction refers to determining that the instruction for artificial intelligence inference also includes a scalar instruction, and responding to the scalar instruction.
  • FIG. 2 The top-level schematic diagram of the architecture of the present application is shown in FIG. 2 .
  • the present application includes a DMA (Direct Memory Access, direct memory access) interface, a scalar data storage unit, an instruction storage unit, a vector storage unit, an instruction fetch, an instruction decoding, an instruction prediction, 32 scalar registers, 32 vector registers, control registers, scalar arithmetic logic unit (or arithmetic logic unit), vector processing unit, convolution control unit and multiplier array.
  • DMA Direct Memory Access, direct memory access
  • a scalar data storage unit an instruction storage unit
  • a vector storage unit an instruction fetch
  • an instruction decoding an instruction prediction
  • 32 scalar registers 32 vector registers
  • control registers control registers
  • scalar arithmetic logic unit or arithmetic logic unit
  • vector processing unit convolution control unit and multiplier array.
  • the DMA interface is responsible for loading the instructions and data in the DDR (Double Data Rate SDRAM, double-rate synchronous dynamic random access memory) into the corresponding storage units; the scalar/vector/instruction storage units are all tightly coupled memories, compared to the cache (cache). ), this tightly coupled memory has low power consumption, fixed latency, and no cache misses, which can meet the real-time and reliability requirements of the processor.
  • DDR Double Data Rate SDRAM, double-rate synchronous dynamic random access memory
  • the instruction fetch/decode/predict unit is a unit in which the processor reads instructions from the instruction storage unit, decodes instructions, and predicts branches.
  • An instruction fetch refers to a fetch, fetch, or fetch. Referring to the schematic diagram of the instruction fetch process shown in Figure 3 (the instruction acquisition flowchart of the artificial intelligence inference method based on RISC-V), after the instruction fetch, the processor determines whether the fetched instruction is a vector instruction or a scalar instruction, and then performs vector/scalar data loading , execute, write back operations.
  • the instruction fetch/decode/prediction unit also has an instruction prediction function. After each instruction is fetched, the address of the next instruction will be generated. If the pre-decoding unit determines that this instruction is a branch instruction, the processor will recalculate the address of the next instruction. address and take out the spare.
  • Vector registers/scalar registers are registers specified in the RISC-V architecture.
  • the scalar register is 32 32-bit registers, and the vector register is a 32-bit wide and customizable vector register.
  • the functions of scalar registers include caching function call return addresses, heap pointers, temporary variables, function parameters or return values, etc.
  • Vector registers are used to cache vector data variables, mask data, intermediate calculation results, etc.
  • a vector functional unit eg, a vector processing unit
  • a scalar functional unit eg, a scalar arithmetic logic unit
  • the scalar arithmetic logic unit (or arithmetic logic unit) completes scalar arithmetic/logic operations;
  • the vector processing unit mainly completes vector operations other than convolution/product, such as matrix transposition, deformation, nonlinear operations, vector accumulation, etc. Function;
  • the convolution control unit is responsible for vector instruction decoding, module register configuration, nonlinear function lookup table cache, vector logic control, etc.;
  • the multiplier array unit mainly completes the functions of convolution and matrix multiplication, and internally integrates 8 Defined as other quantities) multiplier modules, each module integrates 64 8-bit multipliers (the number of multipliers can be customized according to the architecture).
  • the structure of the vector processing unit is shown in Figure 4.
  • the bit width N of the vector register is the length of the register determined during hardware design, for example, the length of N can be 128, 512, or 1024.
  • the data processing flow involved in the vector processing unit is as follows:
  • the vector data is cached in shadow registers (or vector shadow registers).
  • the shadow register is used to cache the data in the vector register and is controlled by the vector load instruction, convolution control and its own state. When there is new data in the vector register, the convolution is allowed to run, and the shadow register is empty, the data in the vector register is allowed to be loaded into the shadow register.
  • the vector in the shadow register is reordered in the vector data reordering unit according to the configuration of the convolution control unit, and the two-dimensional matrix is converted into a three-dimensional matrix in the manner shown in Figure 5, according to the principle of matrix multiplication, Increase the reuse rate of matrix data.
  • the vector data reordering unit also has functions such as matrix transposition, which is responsible for making the data arrangement meet the needs of direct multiplier calls.
  • the vector preprocessing unit is responsible for scaling, offsetting, and intercepting the vector.
  • the processed data is entered into the multiplier input buffer.
  • the purpose of the cache is to multiply the matrix as shown in the figure above.
  • the row data of the first matrix is multiplied by all the column data of the second matrix to obtain the row data of the output matrix.
  • an input buffer unit is set to buffer the multiplexed data. This mechanism can effectively reduce the number of accesses to the DDR by the processor and reduce the processor's demand for DDR bandwidth.
  • the buffered data enters the multiplier array to do the product operation to obtain the product data.
  • the multiplier array is set according to the maximum value of the data that can be loaded by the vector register, which can meet the peak demand of data throughput without redundancy, and make full use of hardware resources without waste.
  • the product data is input to the multiply-accumulator for accumulating operation.
  • the vector activation unit is responsible for nonlinear vector operations, using the look-up table method.
  • the configuration of the lookup table loads the parameters into the buffer area of the vector activation unit through the data load instruction.
  • the structure of the convolution control unit is shown in Figure 7.
  • the convolution control unit controls the entire convolution/matrix multiplication process according to vector instructions.
  • the vector instruction sent by the instruction fetch and decoding unit is decoded in the convolution control unit.
  • Configuration registers are mapped to processor memory space. When the addressing of the vector instruction is this space address segment, it is considered as the command to configure the register, and the register is distinguished according to the address. Configuration registers include registers of different unit modules, such as the data size, data input dimension, data output dimension, and whether to transpose or not of the vector data reordering unit. Coefficients and other registers.
  • processors of the present application can interconnect multiple processors through an on-chip (Mesh) interconnection network (NoC) to form an architecture with greater computing power, such as an artificial intelligence processor (AIPU).
  • NoC on-chip interconnection network
  • AIPU artificial intelligence processor
  • Figure 8 The number of processors can be large or small, very flexible, and can be applied to different hardware scale applications.
  • the processor architecture interconnected by the network supports single instruction multiple data (SIMD) or multiple instruction multiple data (MIMD) mode, and software programming is more flexible.
  • SIMD single instruction multiple data
  • MIMD multiple instruction multiple data
  • the artificial intelligence inference method based on RISC-V obtains the instructions and data of artificial intelligence inference through the direct memory access interface and writes them into the memory; Load into the corresponding register; in response to the instruction being a vector instruction (that is, it determines that the instruction of artificial intelligence inference includes a vector instruction, and responds to the vector instruction), and the convolution control unit processes the corresponding instruction in the vector processing unit based on the vector instruction.
  • Vector data the technical solution of feeding back processed vector data to complete inference, which can apply the RISC-V instruction set to the inference calculation of AI chips, which is convenient for the application and implementation of artificial intelligence inference.
  • each step in each embodiment of the above-mentioned RISC-V-based artificial intelligence reasoning method can be crossed, replaced, added, and deleted.
  • the artificial intelligence reasoning method of the present application should also belong to the protection scope of the present application, and the protection scope of the present application should not be limited to the foregoing embodiments.
  • the present application discloses a RISC-V-based artificial intelligence inference system that facilitates the application and implementation of artificial intelligence inference.
  • the system includes a processor and a memory, and the memory stores a computer executable executable by the processor. Read instructions, which, when executed by the processor, cause the processor to perform the following steps:
  • responding that the instruction is a vector instruction refers to determining that the instruction of artificial intelligence inference includes a vector instruction, and responding to the vector instruction.
  • the memory includes vector data memory, instruction memory, and scalar data memory;
  • the registers include vector registers and scalar registers;
  • the computer-readable instructions when executed by the processor, effect the loading of data from the memory into the corresponding registers based on the instructions
  • the steps include: determining the number of single vector operations based on the environment parameters, so as to use the vector load instruction in the instruction to load the vector data of the single vector operation number into the vector register (ie: determining the number of single vector operations based on the environment parameters) , according to the vector load instruction in the instruction, the vector data of the number of single vector operations is loaded into the vector register).
  • the computer-readable instruction further implements the following steps when executed by the processor: the convolution control unit determines an environment parameter based on the register configuration instruction in the instruction, and the environment parameter includes the effective bit width of the vector, the number of each group of vector registers, the register bit width, and the number of operation vectors currently required; the step of determining the number of single vector operations based on environmental parameters when the computer-readable instruction is executed by the processor includes: dividing the register bit width by the effective bit width of the vector multiplied by the number of vector registers per group. Determine the maximum number of allowed operation vectors, and determine the smaller of the maximum number of allowed operation vectors and the number of currently required operation vectors as the number of single vector operations.
  • the computer-readable instructions when executed by the processor, implement the step of the convolution control unit processing corresponding vector data in the vector processing unit based on the vector instructions, comprising: in response to the vector register having data, the shadow of the vector processing unit When the register is empty and the convolution control unit allows, the vector data is cached from the vector register to the shadow register; the vector data is sequentially reordered and preprocessed in the shadow register, and stored in the multiplier input buffer of the vector processing unit; The vector data is obtained from the multiplier input buffer by the multiplier array of the vector processing unit to perform the product operation under the control of the convolution control unit; the vector data is obtained from the multiplier array by the multiplication accumulator of the vector processing unit to be used in the convolution control unit perform an accumulation operation under the control of the convolution control unit; and obtain the vector data from the multiply-accumulator by a vector activation unit of the vector processing unit to perform a non-linear vector operation using a look-up table under the control of the convolution control unit
  • the computer-readable instructions when executed by the processor, further implement the following steps:
  • a product operation, an accumulation operation, or a nonlinear vector operation is selectively performed on the vector data by the convolution control unit based on real-time control instructions in the instructions.
  • the computer-readable instructions when executed by the processor, further implement the following steps:
  • responding that the instruction is a vector instruction refers to determining that the instruction of artificial intelligence inference includes a vector instruction, and responding to the vector instruction.
  • responding that the instruction is a scalar instruction refers to determining that the instruction for artificial intelligence inference also includes a scalar instruction, and responding to the scalar instruction.
  • the system provided by the embodiments of the present application acquires the instructions and data of artificial intelligence inference through the direct memory access interface and writes them into the memory; acquires and translates the instructions from the acquired memory, and loads the data from the memory into the memory based on the instructions.
  • the corresponding register in response to the instruction being a vector instruction, the convolution control unit processes the corresponding vector data in the vector processing unit based on the vector instruction; the technical scheme of feeding back the processed vector data to complete the inference can convert the RISC-V instruction set Applied to the reasoning calculation of AI chips, it is convenient for the application and implementation of artificial intelligence reasoning.
  • one or more non-volatile computer-readable storage media that store computer-readable instructions that, when executed by one or more processors, cause the one or more processors to Perform the following steps:
  • responding that the instruction is a vector instruction refers to determining that the instruction of artificial intelligence inference includes a vector instruction, and responding to the vector instruction.
  • the step of implementing the instruction-based loading of data from memory into a corresponding register includes: determining a single vector operation number based on environmental parameters to use the vector loading in the instruction
  • the instruction loads the vector data of the number of vector operations in a single operation into the vector register (that is, the number of vector operations in a single operation is determined based on the environment parameters, and the vector data in the number of vector operations in a single operation is loaded into the vector register according to the vector load instruction in the instruction ).
  • the convolution control unit determines an environment parameter based on a register configuration instruction in the instruction, where the environment parameter includes the effective bit width of the vector, the number of bits per set of vector registers number, register bit width, and the number of vectors currently required to be operated;
  • the step of determining the number of single vector operations based on environmental parameters includes: dividing the register bit width by the vector effective bit width multiplied by the number of each set of vector registers. Determine the maximum number of allowed operation vectors, and determine the smaller of the maximum number of allowed operation vectors and the number of currently required operation vectors as the number of single vector operations.
  • the step of realizing that the convolution control unit processes the corresponding vector data in the vector processing unit based on the vector instructions includes:
  • the shadow register of the vector processing unit is empty, and the convolution control unit allows, buffering the vector data from the vector register to the shadow register;
  • the vector data is sequentially reordered and preprocessed in the shadow register, and stored in the multiplier input buffer of the vector processing unit;
  • the vector data is obtained from the multiply-accumulator by the vector activation unit of the vector processing unit to perform nonlinear vector operations using a look-up table under the control of the convolution control unit.
  • the computer-readable instructions when executed by the processor, further implement the following steps:
  • a product operation, an accumulation operation, or a nonlinear vector operation is selectively performed on the vector data by the convolution control unit based on real-time control instructions in the instructions.
  • the computer-readable instructions when executed by the processor, further implement the following steps:
  • responding that the instruction is a scalar instruction refers to determining that the instruction for artificial intelligence inference also includes a scalar instruction, and responding to the scalar instruction.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

一种基于RISC-V的人工智能推理方法和系统,基于RISC-V的人工智能推理方法包括以下步骤:通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器(S101);从存储器获取和翻译指令,基于指令将数据从存储器加载到对应的寄存器(S103);响应于指令是向量指令,而由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据(S105);及反馈处理过的向量数据以完成推理(S107)。

Description

一种基于RISC-V的人工智能推理方法和系统
相关申请的交叉引用
本申请要求在2020年12月24日提交中国专利局,申请号为202011554149.6,发明名称为“一种基于RISC-V的人工智能推理方法和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,更具体地,特别是指一种基于RISC-V的人工智能推理方法和系统。
背景技术
目前,AI(Artificial Intelligence,人工智能)芯片大致的分类如下:从应用场景角度看,AI芯片主要有两个方向,一个是在数据中心部署的云端,一个是在消费者终端部署的终端。从功能角度看,AI芯片主要做两个事情,一是Training(训练),二是Inference(推理)。目前AI芯片的大规模应用分别在云端和终端。云端的AI芯片同时做两个事情:Training和Inference。Training即用大量标记过的数据来“训练”相应的系统,使之可以适应特定的功能,比如给系统海量的“猫”的图片,并告诉系统这个就是“猫”,之后系统就“知道”什么是猫了;Inference即用训练好的系统来完成任务,接上面的例子,就是你将一张图给之前训练过的系统,让它得出这张图是不是猫这样的结论。
云端的AI芯片目前主要是GPU(graphics processing unit,图形处理器),由于训练需要的数据量大、算力大、功耗大,需要大规模的散热,Training将在很长一段时间里集中在云端,Inference的完成目前也主要集中在云端,但随着越来越多厂商的努力,很多的应用将逐渐转移到终端,如目前应用比较多的自动驾驶芯片。
在终端完成Inference,主要是满足终端低延时的需求,云端推理的延时与网络相关,一般延时较大,难以满足终端(如自动驾驶)的需求;满足终端多样化需求;以及初步筛选终端的数据,将有效数据传送到云端等等。
RISC-V是一个基于精简指令集(RISC)原则的开源指令集架构(ISA),与大多数指令集相比,RISC-V指令集可以自由地用于任何目的,允许任何人设计、制造和销售RISC-V芯片和软件而不必支付给任何公司专利费。虽然这不是第一个开源指令集,但它具有重要意义,因为其设计使其适用于现代计算设备(如仓库规模云计算机、高端移动电话和微小嵌入式系统)。设计者考虑到了这些用途中的性能与功率效率。该指令集还具有众多支持的软件,这解决了新指令集通常的弱点。
RISC-V指令集的设计考虑了小型、快速、低功耗的现实情况来实做,但并没有对特定的微架构做过度的设计。指令集因为位于硬件和软件之间,所以是电脑主要的沟通桥梁,因此如果有一个设计良好的指令集是开源而且可以被任何人使用的,就可以让更多的资源能够重复利用,而大大的减少软件的成本。而这样的指令集也会增加硬件供应商市场的竞争力,因为硬件供应商们可以挪用更多资源来进行设计,减少处理软件支持的事务。然而,RISC-V指令集缺少处理器的硬件设计和软件支持,导致其不能用作AI芯片的推理计算。
针对现有技术中RISC-V指令集缺少处理器的硬件设计和软件支持、不能用作AI芯片的推理计算的问题,目前尚无有效的解决方案。
发明内容
在一些实施方式中,本申请公开了一种基于RISC-V的人工智能推理方法,包括执行以下步骤:
通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;
从存储器获取和翻译指令,基于指令将数据从存储器加载到对应的寄存器;
响应于指令是向量指令,而由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据;及
反馈处理过的向量数据以完成推理。
其中,响应于指令是向量指令是指确定人工智能推理的指令中包括向量指令,并响应于该向量指令。
在一些实施方式中,存储器包括向量数据存储器、指令存储器、和标量数据存储器;寄存器包括向量寄 存器和标量寄存器。
在一些实施方式中,基于指令将数据从存储器加载到对应的寄存器的步骤包括:基于环境参数确定单次向量操作个数,根据指令中的向量加载指令将单次向量操作个数的向量数据加载到向量寄存器。
在一些实施方式中,基于RISC-V的人工智能推理方法还包括以下步骤:由卷积控制单元基于指令中的寄存器配置指令确定环境参数,环境参数包括向量有效位宽、向量寄存器每组个数、寄存器位宽、和当前需要操作向量个数;
基于环境参数确定单次向量操作个数的步骤包括:根据寄存器位宽、向量有效位宽以及向量寄存器每组个数确定允许操作向量最大个数,并将允许操作向量最大个数和当前需要操作向量个数中的较小值确定为单次向量操作个数。
在一些实施方式中,由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据的步骤包括:
响应于向量寄存器有数据、向量处理单元的影子寄存器为空、并且卷积控制单元允许,将向量数据从向量寄存器缓存到影子寄存器;
在影子寄存器中依次对向量数据执行重排序处理和预处理,并存入向量处理单元的乘法器输入缓存;
由向量处理单元的乘法器阵列从乘法器输入缓存获取向量数据以在卷积控制单元的控制下执行乘积运算;
由向量处理单元的乘积累加器从乘法器阵列获取向量数据以在卷积控制单元的控制下执行累加运算;及
由向量处理单元的向量激活单元从乘积累加器获取向量数据以在卷积控制单元的控制下使用查找表执行非线性向量运算。
在一些实施方式中,基于RISC-V的人工智能推理方法还包括以下步骤:由卷积控制单元基于指令中的查找表激活指令为向量激活单元配置查找表的缓存区;及
由卷积控制单元基于指令中的实时控制指令使向量数据选择性地执行乘积运算、累加运算、或非线性向量运算。
在一些实施方式中,基于RISC-V的人工智能推理方法还包括以下步骤:
响应于指令是标量指令,而基于标量指令在算数/逻辑运算单元处理对应的标量数据;及
反馈处理过的标量数据以完成推理。
其中,响应于指令是标量指令是指确定人工智能推理的指令中还包括标量指令,并响应于该标量指令。
在一些实施方式中,公开了一种基于RISC-V的人工智能推理系统,该系统包括处理器和存储器;存储器存储有处理器可执行的计算机可读指令,计算机可读指令在被处理器执行时,使得处理器执行以下步骤:
通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;
从存储器获取和翻译指令,基于指令将数据从存储器加载到对应的寄存器;
响应于指令是向量指令,而由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据;及
反馈处理过的向量数据以完成推理。
其中,响应于指令是向量指令是指确定人工智能推理的指令中包括向量指令,并响应于该向量指令。
在一些实施方式中,存储器包括向量数据存储器、指令存储器、和标量数据存储器;寄存器包括向量寄存器和标量寄存器;基于指令将数据从存储器加载到对应的寄存器的步骤包括:基于环境参数确定单次向量操作个数,根据指令中的向量加载指令将单次向量操作个数的向量数据加载到向量寄存器;
处理器在执行计算机可读指令时还执行以下步骤:由卷积控制单元基于指令中的寄存器配置指令确定环境参数,环境参数包括向量有效位宽、向量寄存器每组个数、寄存器位宽、和当前需要操作向量个数;基于环境参数确定单次向量操作个数包括:根据寄存器位宽、向量有效位宽乘以及向量寄存器每组个数确定允许操作向量最大个数,并将允许操作向量最大个数和当前需要操作向量个数中的较小值确定为单次向量操作个数。
在一些实施方式中,由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据的步骤包括:响 应于向量寄存器有数据、向量处理单元的影子寄存器为空、并且卷积控制单元允许,将向量数据从向量寄存器缓存到影子寄存器;在影子寄存器中依次对向量数据执行重排序处理和预处理,并存入向量处理单元的乘法器输入缓存;由向量处理单元的乘法器阵列从乘法器输入缓存获取向量数据以在卷积控制单元的控制下执行乘积运算;由向量处理单元的乘积累加器从乘法器阵列获取向量数据以在卷积控制单元的控制下执行累加运算;及由向量处理单元的向量激活单元从乘积累加器获取向量数据以在卷积控制单元的控制下使用查找表执行非线性向量运算。
在一些实施方式中,公开了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;
从存储器获取和翻译指令,基于指令将数据从存储器加载到对应的寄存器;
响应于指令是向量指令,而由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据;及
反馈处理过的向量数据以完成推理。
其中,响应于指令是向量指令是指确定人工智能推理的指令中包括向量指令,并响应于该向量指令。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请的一些实施方式中,基于RISC-V的人工智能推理方法的流程示意图;
图2为本申请的一些实施方式中,基于RISC-V的人工智能推理方法的模块示意图;
图3为本申请的一些实施方式中,基于RISC-V的人工智能推理方法的指令获取流程图;
图4为本申请的一些实施方式中,基于RISC-V的人工智能推理方法的向量处理单元结构图;
图5为本申请的一些实施方式中,基于RISC-V的人工智能推理方法的向量重排序流程图;
图6为本申请的一些实施方式中,基于RISC-V的人工智能推理方法的向量处理单元整体流程图;
图7为本申请的一些实施方式中,基于RISC-V的人工智能推理方法的卷积控制单元示意图;
图8为本申请的一些实施方式中,基于RISC-V的人工智能推理方法的处理器片上互联结构图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本申请实施例进一步详细说明。
需要说明的是,本申请实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”和“第二”仅为了表述的方便,不应理解为对本申请实施例的限定,后续实施例对此不再一一说明。
图1示出的是本一个实施方式中,基于RISC-V的人工智能推理方法的流程示意图。
基于RISC-V的人工智能推理方法,如图1所示,包括以下步骤:
步骤S101,通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;
步骤S103,从存储器获取和翻译指令,基于指令将数据从存储器加载到对应的寄存器;
步骤S105,响应于指令是向量指令,而由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据;及
步骤S107,反馈处理过的向量数据以完成推理。
其中,步骤S103中的指令是指人工智能推理的指令。响应于指令是向量指令是指确定人工智能推理的指令中包括向量指令,并响应于该向量指令。
本申请公开了一种基于RISC-V指令集的AI芯片架构,能够完成卷积计算或矩阵计算,也可以作为AI 推理加速器、协助处理器完成卷积/矩阵计算。由于完全兼容RISC-V精简指令集,本申请可以在RISC-V软件工具链上做进一步的开发,因而大大减小了软件工具链的开发难度。本申请设计核心是基于RISC-V指令集的卷积运算架构,可以完成标量运算、向量运算、卷积运算、矩阵运算、非线性激活运算等,能够满足人工智能推理的所有计算需求。AI芯片也可以通过片上网格(Mesh)互联网络(NoC)互联起来,组成更大算力的架构,满足不同的终端算力需求。
本领域普通技术人员可以理解实现上述实施方式中的全部或部分流程,可以通过计算机可读指令来指令相关硬件来完成,前述的计算机可读指令可存储于一非易失性计算机可读存储介质中,该计算机可读指令在被执行时,可实现基于RISC-V的人工智能推理方法的流程。其中,前述的非易失性计算机可读存储介质可为磁碟、光盘、只读存储记忆体(ROM)或随机存储记忆体(RAM)等。前述计算机程序的实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。
在一些实施方式中,存储器包括向量数据存储器、指令存储器、和标量数据存储器;寄存器包括向量寄存器和标量寄存器。
在一些实施方式中,基于指令将数据从存储器加载到对应的寄存器的步骤包括:基于环境参数确定单次向量操作个数,以使用指令中的向量加载指令将单次向量操作个数个向量数据加载到向量寄存器(即:基于环境参数确定单次向量操作个数,根据指令中的向量加载指令将单次向量操作个数的向量数据加载到向量寄存器)。
在一些实施方式中,基于RISC-V的人工智能推理方法还包括以下步骤:由卷积控制单元基于指令中的寄存器配置指令确定环境参数,环境参数包括向量有效位宽、向量寄存器每组个数、寄存器位宽、和当前需要操作向量个数;
基于环境参数确定单次向量操作个数的步骤包括:基于寄存器位宽除以向量有效位宽乘以向量寄存器每组个数而确定允许操作向量最大个数,并将允许操作向量最大个数和当前需要操作向量个数中的较小值确定为单次向量操作个数。
在一些实施方式中,由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据的步骤包括:
响应于向量寄存器有数据、向量处理单元的影子寄存器为空、并且卷积控制单元允许,将向量数据从向量寄存器缓存到影子寄存器;
在影子寄存器中依次对向量数据执行重排序处理和预处理,并存入向量处理单元的乘法器输入缓存;
由向量处理单元的乘法器阵列从乘法器输入缓存获取向量数据以在卷积控制单元的控制下执行乘积运算;
由向量处理单元的乘积累加器从乘法器阵列获取向量数据以在卷积控制单元的控制下执行累加运算;及
由向量处理单元的向量激活单元从乘积累加器获取向量数据以在卷积控制单元的控制下使用查找表执行非线性向量运算。
在一些实施方式中,方法还包括:由卷积控制单元基于指令中的查找表激活指令为向量激活单元配置查找表的缓存区;
由卷积控制单元基于指令中的实时控制指令使向量处理单元选择性地执行乘积运算、累加运算、或非线性向量运算。
在一些实施方式中,基于RISC-V的人工智能推理方法还包括执行以下步骤:
响应于指令是标量指令,而基于标量指令在算数/逻辑运算单元处理对应的标量数据;及
反馈处理过的标量数据以完成推理。
其中,响应于指令是标量指令是指确定人工智能推理的指令中还包括标量指令,并响应于该标量指令。
下面根据具体实施例进一步阐述本申请的具体实施方式。
本申请的架构顶层原理图参见图2。如图2所示,本申请包括DMA(Direct Memory Access,直接存储器访问)接口、标量数据存储单元、指令存储单元、向量存储单元、指令取指、指令译码、指令预测、32 个标量寄存器、32个向量寄存器、控制寄存器、标量算术逻辑单元(或称算术逻辑单元)、向量处理单元、卷积控制单元以及乘法器阵列。
DMA接口负责将DDR(Double Data Rate SDRAM,双倍速率同步动态随机存储器)中的指令与数据加载到相应的存储单元;标量/向量/指令存储单元都是紧密耦合存储器,相比于缓存(cache)而言,这种紧密耦合存储器功耗低、延时固定,不存在cache不命中的情况,可以满足处理器的实时性、可靠性的需求。
取指/译码/预测单元是处理器从指令存储单元中读取指令、指令译码、分支预测的单元。取指是指读取指令、获取指令或取出指令。参见图3所示取指流程示意图(基于RISC-V的人工智能推理方法的指令获取流程图),取指后处理器判断取出的指令是向量指令还是标量指令,然后执行向量/标量的数据加载、执行、写回的操作。取指/译码/预测单元还有指令预测功能,每条指令取出之后,就会生成下一条指令地址,若预译码单元判断出本条指令为分支指令,处理器会重新计算下一条指令的地址,并取出备用。
向量寄存器/标量寄存器是RISC-V架构中规定的寄存器。标量寄存器为32个32位寄存器,向量寄存器是32个位宽可自定义的向量寄存器。标量寄存器的功能有缓存函数调用返回地址、堆指针、临时变量、函数参数或返回值等。向量寄存器是用于缓存向量数据变量、掩码数据、中间计算结果等。该处理器中的向量功能单元(如向量处理单元)和标量功能单元(如标量算术逻辑单元)共用处理器的配置寄存器和状态寄存器。
其中,标量算术逻辑单元(或称算术逻辑单元)完成标量算术/逻辑运算;向量处理单元主要完成除卷积/乘积之外的向量运算,如矩阵转置、变形、非线性运算、向量累加等功能;卷积控制单元负责向量指令译码、模块寄存器配置、非线性函数查找表缓存、向量逻辑控制等;乘法器阵列单元主要完成卷积和矩阵乘的功能,内部集成了8个(可自定义为其他数量)乘法器模块,每个模块集成了64个8位乘法器(乘法器数量可根据架构自定义)。
向量处理单元的结构详见图4。向量寄存器的位宽N是硬件设计时确定下的寄存器长度,例如,N的长度可以是128、512或1024等。向量处理单元涉及的数据处理流程如下:
1)用指令vsetvli(RISC-V指令集向量大小设置命令)设置向量有效位宽SEW,每组中向量寄存器的个数LMUL,当前需要操作的向量个数Nw,以及允许操作的最大向量个数为Ne;由于向量寄存器的限制,Ne为(N/SEW)*LMUL;若Nw>Ne,则每次操作的向量个数为Ne,若Nw小于或等于Ne,则每次操作的向量个数为Nw。
2)用向量加载指令加载向量数据到向量寄存器中,该向量加载指令可以是vlen.v(n=8,16,32,…)。
3)向量数据被缓存到影子寄存器(或称向量影子寄存器)中。影子寄存器是用于缓存向量寄存器中的数据,受向量加载指令、卷积控制和自身状态共同控制。当向量寄存器中有了新数据,且卷积允许运行,且影子寄存器空状态,才允许把向量寄存器中的数据加载到影子寄存器中。
4)对影子寄存器中的向量根据卷积控制单元的配置在向量数据重排序单元中进行数据重排序,以如图5所示的方式将二维矩阵转换为三维矩阵,根据矩阵乘的原理,增加矩阵数据的重复利用率。向量数据重排序单元还有矩阵转置等功能,负责使数据的排列满足乘法器直接调用的需求。
5)向量预处理单元负责将向量进行缩放、偏移、截取等。
6)处理好的数据,进入乘法器输入缓存中。缓存的目的是:如上图所示的矩阵乘,第一个矩阵的行数据要与第二个矩阵的所有的列数据相乘,得到输出矩阵的行数据,这个过程中很多数据可以复用,因此设置一个输入缓存单元,缓存复用数据。这种机制可以有效减少处理器对DDR的访问次数,降低了处理器对DDR带宽的需求。
7)缓存的数据进入乘法器阵列做乘积运算,得到乘积数据。乘法器阵列是根据向量寄存器能够加载的数据量的最大值设置的,能够满足数据吞吐的峰值需求,也不会出现冗余,充分利用硬件资源,而不浪费。
8)乘积数据被输入乘积累加器做累加运算。乘积累加器中设有累加缓存单元,存储中间结果,当累加值为最后输出结果时,才会被输出到下一个单元。
9)向量激活单元负责非线性向量运算,使用查找表的方法。查找表的配置通过数据加载指令把参数加载 到向量激活单元的缓存区。
10)最终的计算结果,在向量存储指令的控制下,通过向量寄存器,存储到相应的位置。整个计算过程受卷积控制单元控制,不需要的环节可设置为bypass(绕过)。整个过程在图6中被完整地示出。
卷积控制单元的结构详见图7。卷积控制单元根据向量指令,控制着整个卷积/矩阵乘的过程。取指译码单元发送过来的向量指令在卷积控制单元中译码。
1)若向量指令为卷积控制寄存器配置命令,则配置寄存器。配置寄存器是映射到处理器存储空间上。当向量指令的寻址是此空间地址段,则认为是配置寄存器的命令,寄存器根据地址进行区分。配置寄存器包括不同单元模块的寄存器,如向量数据重排序单元的数据大小、数据输入维度、数据输出维度、是否转置等寄存器;再如向量预处理单元的数据缩放系数、数据偏移系数、截取系数等寄存器。
2)若向量指令为配置向量激活单元的查找表,则直接将数据配置到向量激活单元的查找表缓存区。
3)若为卷积实时控制指令,则根据指令实时控制,如数据乘、数据加载、数据存储指令。
另外,本申请的处理器可以通过片上网格(Mesh)互联网络(NoC),将多个处理器互联起来,组成更大算力的架构,如人工智能处理器(AIPU),其结构详见图8。处理器数目可大可小,非常灵活,可适用于不同的硬件规模应用中。通过网络互联的处理器架构支持单指令多数据(SIMD)或多指令多数据(MIMD)模式,软件编程更加灵活。
从上述实施方式中可以看出,基于RISC-V的人工智能推理方法,通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;从存储器获取和翻译指令,基于指令将数据从存储器加载到对应的寄存器;响应于指令是向量指令(即其确定人工智能推理的指令中包括向量指令,并响应于该向量指令),而由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据;反馈处理过的向量数据以完成推理的技术方案,能够将RISC-V指令集应用到AI芯片的推理计算中,便于人工智能推理的应用和落地。
需要特别指出的是,上述基于RISC-V的人工智能推理方法的各个实施例中的各个步骤均可以相互交叉、替换、增加、删减,因此,这些合理的排列组合变换之于基于RISC-V的人工智能推理方法也应当属于本申请的保护范围,并且不应将本申请的保护范围局限在前述实施例之中。
在一些实施方式中,本申请公开了了一种便于人工智能推理的应用和落地的基于RISC-V的人工智能推理系统,该系统包括处理器和存储器,存储器存储有处理器可执行的计算机可读指令,计算机可读指令在被处理器执行时,使得处理器执行以下步骤:
通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;
从存储器获取和翻译指令,基于指令将数据从存储器加载到对应的寄存器;
响应于指令是向量指令,而由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据;及
反馈处理过的向量数据以完成推理。
其中,响应于指令是向量指令是指确定人工智能推理的指令中包括向量指令,并响应于该向量指令。
在一些实施方式中,存储器包括向量数据存储器、指令存储器、和标量数据存储器;寄存器包括向量寄存器和标量寄存器;计算机可读指令在被处理器执行时实现基于指令将数据从存储器加载到对应的寄存器的步骤包括:基于环境参数确定单次向量操作个数,以使用指令中的向量加载指令将单次向量操作个数个向量数据加载到向量寄存器(即:基于环境参数确定单次向量操作个数,根据指令中的向量加载指令将单次向量操作个数的向量数据加载到向量寄存器)。
计算机可读指令在被处理器执行时还实现以下步骤:由卷积控制单元基于指令中的寄存器配置指令确定环境参数,环境参数包括向量有效位宽、向量寄存器每组个数、寄存器位宽、和当前需要操作向量个数;计算机可读指令在被处理器执行时实现基于环境参数确定单次向量操作个数的步骤包括:基于寄存器位宽除以向量有效位宽乘以向量寄存器每组个数而确定允许操作向量最大个数,并将允许操作向量最大个数和当前需要操作向量个数中的较小值确定为单次向量操作个数。
在一些实施方式中,计算机可读指令在被处理器执行时实现由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据的步骤包括:响应于向量寄存器有数据、向量处理单元的影子寄存器为空、并且卷积控制单元允许,将向量数据从向量寄存器缓存到影子寄存器;在影子寄存器中依次对向量数据执行重排序处理和预处理,并存入向量处理单元的乘法器输入缓存;由向量处理单元的乘法器阵列从乘法器输入缓存获取向量数据以在卷积控制单元的控制下执行乘积运算;由向量处理单元的乘积累加器从乘法器阵列获取向量数据以在卷积控制单元的控制下执行累加运算;及由向量处理单元的向量激活单元从乘积累加器获取向量数据以在卷积控制单元的控制下使用查找表执行非线性向量运算。
在一些实施方式中,计算机可读指令在被处理器执行时还实现以下步骤:
由卷积控制单元基于指令中的查找表激活指令为向量激活单元配置查找表的缓存区;及
由卷积控制单元基于指令中的实时控制指令使向量数据选择性地执行乘积运算、累加运算、或非线性向量运算。
在一些实施方式中,计算机可读指令在被处理器执行时还实现以下步骤:
响应于指令是标量指令,而基于标量指令在算数/逻辑运算单元处理对应的标量数据;及
反馈处理过的标量数据以完成推理。
其中,响应于指令是向量指令是指确定人工智能推理的指令中包括向量指令,并响应于该向量指令。其中,响应于指令是标量指令是指确定人工智能推理的指令中还包括标量指令,并响应于该标量指令。
从上述实施例可以看出,本申请实施例提供的系统,通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;从获取存储器获取和翻译指令,基于指令将数据从存储器加载到对应的寄存器;响应于指令是向量指令,而由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据;反馈处理过的向量数据以完成推理的技术方案,能够将RISC-V指令集应用到AI芯片的推理计算中,便于人工智能推理的应用和落地。
需要特别指出的是,上述基于RISC-V的人工智能推理系统的实施例采用了基于RISC-V的人工智能推理方法的实施例来具体说明各模块的工作过程,本领域技术人员能够很容易想到,将这些模块应用到基于RISC-V的人工智能推理方法的其他实施例中。当然,由于基于RISC-V的人工智能推理方法实施例中的各个步骤均可以相互交叉、替换、增加、删减,因此,这些合理的排列组合变换之于基于RISC-V的人工智能推理系统也应当属于本申请的保护范围,并且不应将本申请的保护范围局限在前述的实施例之上。
在一些实施方式中,公开了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;
从存储器获取和翻译指令,基于指令将数据从存储器加载到对应的寄存器;
响应于指令是向量指令,而由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据;及
反馈处理过的向量数据以完成推理。
其中,响应于指令是向量指令是指确定人工智能推理的指令中包括向量指令,并响应于该向量指令。
在一些实施方式中,计算机可读指令被处理器执行时,实现基于指令将数据从存储器加载到对应的寄存器的步骤包括:基于环境参数确定单次向量操作个数,以使用指令中的向量加载指令将单次向量操作个数个向量数据加载到向量寄存器(即:基于环境参数确定单次向量操作个数,根据指令中的向量加载指令将单次向量操作个数的向量数据加载到向量寄存器)。
在一些实施方式中,计算机可读指令被处理器执行时,还实现以下步骤:由卷积控制单元基于指令中的寄存器配置指令确定环境参数,环境参数包括向量有效位宽、向量寄存器每组个数、寄存器位宽、和当前需要操作向量个数;
在一些实施方式中,计算机可读指令被处理器执行时,实现基于环境参数确定单次向量操作个数的步骤 包括:基于寄存器位宽除以向量有效位宽乘以向量寄存器每组个数而确定允许操作向量最大个数,并将允许操作向量最大个数和当前需要操作向量个数中的较小值确定为单次向量操作个数。
在一些实施方式中,计算机可读指令被处理器执行时,实现由卷积控制单元基于向量指令在向量处理单元处理对应的向量数据的步骤包括:
响应于向量寄存器有数据、向量处理单元的影子寄存器为空、并且卷积控制单元允许,将向量数据从向量寄存器缓存到影子寄存器;
在影子寄存器中依次对向量数据执行重排序处理和预处理,并存入向量处理单元的乘法器输入缓存;
由向量处理单元的乘法器阵列从乘法器输入缓存获取向量数据以在卷积控制单元的控制下执行乘积运算;
由向量处理单元的乘积累加器从乘法器阵列获取向量数据以在卷积控制单元的控制下执行累加运算;及
由向量处理单元的向量激活单元从乘积累加器获取向量数据以在卷积控制单元的控制下使用查找表执行非线性向量运算。
在一些实施方式中,计算机可读指令被处理器执行时,还实现以下步骤:
由卷积控制单元基于指令中的查找表激活指令为向量激活单元配置查找表的缓存区;及
由卷积控制单元基于指令中的实时控制指令使向量数据选择性地执行乘积运算、累加运算、或非线性向量运算。
在一些实施方式中,计算机可读指令被处理器执行时,还实现以下步骤:
响应于指令是标量指令,而基于标量指令在算数/逻辑运算单元处理对应的标量数据;及
反馈处理过的标量数据以完成推理。其中,响应于指令是标量指令是指确定人工智能推理的指令中还包括标量指令,并响应于该标量指令。
以上是本申请公开的示例性实施例,但是应当注意,在不背离权利要求限定的本申请实施例公开的范围的前提下,可以进行多种改变和修改。根据这里描述的公开实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本申请实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本申请实施例公开的范围(包括权利要求)被限于这些例子;在本申请实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上前述的本申请实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。因此,凡在本申请实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。

Claims (11)

  1. 一种基于RISC-V的人工智能推理方法,包括以下步骤:
    通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;
    从所述存储器获取和翻译所述指令,基于所述指令将所述数据从所述存储器加载到对应的寄存器;
    响应于所述指令是向量指令,而由卷积控制单元基于所述向量指令在向量处理单元处理对应的向量数据;及
    反馈处理过的所述向量数据以完成推理。
  2. 根据权利要求1所述的方法,其特征在于,所述存储器包括向量数据存储器、指令存储器、和标量数据存储器;所述寄存器包括向量寄存器和标量寄存器。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述指令将所述数据从所述存储器加载到对应的寄存器的步骤包括:基于环境参数确定单次向量操作个数,根据所述指令中的向量加载指令将单次向量操作个数的所述向量数据加载到所述向量寄存器。
  4. 根据权利要求3所述的方法,其特征在于,还包括以下步骤:由所述卷积控制单元基于所述指令中的寄存器配置指令确定环境参数,所述环境参数包括向量有效位宽、向量寄存器每组个数、寄存器位宽、和当前需要操作向量个数;
    所述基于环境参数确定单次向量操作个数的步骤包括:根据所述寄存器位宽、所述向量有效位宽以及所述向量寄存器每组个数确定允许操作向量最大个数,并将所述允许操作向量最大个数和所述当前需要操作向量个数中的较小值确定为所述单次向量操作个数。
  5. 根据权利要求1所述的方法,其特征在于,所述由卷积控制单元基于所述向量指令在向量处理单元处理对应的向量数据的步骤包括:
    响应于所述向量寄存器有数据、所述向量处理单元的影子寄存器为空、并且所述卷积控制单元允许,将所述向量数据从所述向量寄存器缓存到所述影子寄存器;
    在所述影子寄存器中依次对所述向量数据执行重排序处理和预处理,并存入所述向量处理单元的乘法器输入缓存;
    由所述向量处理单元的乘法器阵列从所述乘法器输入缓存获取所述向量数据以在所述卷积控制单元的控制下执行乘积运算;
    由所述向量处理单元的乘积累加器从所述乘法器阵列获取所述向量数据以在所述卷积控制单元的控制下执行累加运算;及
    由所述向量处理单元的向量激活单元从所述乘积累加器获取所述向量数据以在所述卷积控制单元的控制下使用查找表执行非线性向量运算。
  6. 根据权利要求5所述的方法,其特征在于,还包括以下步骤:
    由所述卷积控制单元基于所述指令中的查找表激活指令为所述向量激活单元配置所述查找表的缓存区;及
    由所述卷积控制单元基于所述指令中的实时控制指令使所述向量数据选择性地执行乘积运算、累加运算、或非线性向量运算。
  7. 根据权利要求1所述的方法,其特征在于,还包括以下步骤:
    响应于所述指令是标量指令,而基于所述标量指令在算数/逻辑运算单元处理对应的标量数据;及
    反馈处理过的所述标量数据以完成推理。
  8. 一种基于RISC-V的人工智能推理系统,包括处理器和存储器;
    所述存储器存储有处理器可执行的计算机可读指令,所述计算机可读指令在被所述处理器执行时,使得所述处理器执行以下步骤:
    通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;
    从所述存储器获取和翻译所述指令,基于所述指令将所述数据从所述存储器加载到对应的寄存器;
    响应于所述指令是向量指令,而由卷积控制单元基于所述向量指令在向量处理单元处理对应的向量数据;及
    反馈处理过的所述向量数据以完成推理。
  9. 根据权利要求8所述的系统,其特征在于,所述存储器包括向量数据存储器、指令存储器、和标量数据存储器;所述寄存器包括向量寄存器和标量寄存器;所述基于所述指令将所述数据从所述存储器加载到对应的寄存器,包括:基于环境参数确定单次向量操作个数,根据所述指令中的向量加载指令将单次向量操作个数的所述向量数据加载到所述向量寄存器;
    所述处理器在执行所述计算机可读指令时还执行以下步骤:由所述卷积控制单元基于所述指令中的寄存器配置指令确定环境参数,所述环境参数包括向量有效位宽、向量寄存器每组个数、寄存器位宽、和当前需要操作向量个数;
    所述基于环境参数确定单次向量操作个数的步骤包括:根据所述寄存器位宽、所述向量有效位宽以及所述向量寄存器每组个数确定允许操作向量最大个数,并将所述允许操作向量最大个数和所述当前需要操作向量个数中的较小值确定为所述单次向量操作个数。
  10. 根据权利要求8所述的系统,其特征在于,所述由卷积控制单元基于所述向量指令在向量处理单元处理对应的向量数据的步骤包括:响应于所述向量寄存器有数据、所述向量处理单元的影子寄存器为空、并且所述卷积控制单元允许,将所述向量数据从所述向量寄存器缓存到所述影子寄存器;在所述影子寄存器中依次对所述向量数据执行重排序处理和预处理,并存入所述向量处理单元的乘法器输入缓存;由所述向量处理单元的乘法器阵列从所述乘法器输入缓存获取所述向量数据以在所述卷积控制单元的控制下执行乘积运算;由所述向量处理单元的乘积累加器从所述乘法器阵列获取所述向量数据以在所述卷积控制单元的控制下执行累加运算;及由所述向量处理单元的向量激活单元从所述乘积累加器获取所述向量数据以在所述卷积控制单元的控制下使用查找表执行非线性向量运算。
  11. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    通过直接存储器访问接口获取人工智能推理的指令和数据并写入存储器;
    从所述存储器获取和翻译所述指令,基于所述指令将所述数据从所述存储器加载到对应的寄存器;
    响应于所述指令是向量指令,而由卷积控制单元基于所述向量指令在向量处理单元处理对应的向量数据;及
    反馈处理过的所述向量数据以完成推理。
PCT/CN2021/122287 2020-12-24 2021-09-30 一种基于risc-v的人工智能推理方法和系统 WO2022134729A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/246,662 US11880684B2 (en) 2020-12-24 2021-09-30 RISC-V-based artificial intelligence inference method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011554149.6 2020-12-24
CN202011554149.6A CN112633505B (zh) 2020-12-24 2020-12-24 一种基于risc-v的人工智能推理方法和系统

Publications (1)

Publication Number Publication Date
WO2022134729A1 true WO2022134729A1 (zh) 2022-06-30

Family

ID=75324611

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122287 WO2022134729A1 (zh) 2020-12-24 2021-09-30 一种基于risc-v的人工智能推理方法和系统

Country Status (3)

Country Link
US (1) US11880684B2 (zh)
CN (1) CN112633505B (zh)
WO (1) WO2022134729A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633505B (zh) 2020-12-24 2022-05-27 苏州浪潮智能科技有限公司 一种基于risc-v的人工智能推理方法和系统
CN113642722A (zh) * 2021-07-15 2021-11-12 深圳供电局有限公司 用于卷积计算的芯片及其控制方法、电子装置
CN114116513B (zh) * 2021-12-03 2022-07-29 中国人民解放军战略支援部队信息工程大学 多指令集架构向risc-v指令集架构的寄存器映射方法及装置
CN115576606B (zh) * 2022-11-16 2023-03-21 苏州浪潮智能科技有限公司 实现矩阵转置乘的方法、协处理器、服务器及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017124648A1 (zh) * 2016-01-20 2017-07-27 北京中科寒武纪科技有限公司 一种向量计算装置
CN110007961A (zh) * 2019-02-01 2019-07-12 中山大学 一种基于risc-v的边缘计算硬件架构
CN111078287A (zh) * 2019-11-08 2020-04-28 苏州浪潮智能科技有限公司 一种向量运算协处理方法与装置
CN111158756A (zh) * 2019-12-31 2020-05-15 百度在线网络技术(北京)有限公司 用于处理信息的方法和装置
CN112633505A (zh) * 2020-12-24 2021-04-09 苏州浪潮智能科技有限公司 一种基于risc-v的人工智能推理方法和系统

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030221086A1 (en) * 2002-02-13 2003-11-27 Simovich Slobodan A. Configurable stream processor apparatus and methods
US8516026B2 (en) * 2003-03-10 2013-08-20 Broadcom Corporation SIMD supporting filtering in a video decoding system
GB2489914B (en) * 2011-04-04 2019-12-18 Advanced Risc Mach Ltd A data processing apparatus and method for performing vector operations
US9342284B2 (en) * 2013-09-27 2016-05-17 Intel Corporation Optimization of instructions to reduce memory access violations
US11106976B2 (en) * 2017-08-19 2021-08-31 Wave Computing, Inc. Neural network output layer for machine learning
US11803377B2 (en) * 2017-09-08 2023-10-31 Oracle International Corporation Efficient direct convolution using SIMD instructions
US20200174707A1 (en) * 2017-10-27 2020-06-04 Wave Computing, Inc. Fifo filling logic for tensor calculation
CN108958801B (zh) * 2017-10-30 2021-06-25 上海寒武纪信息科技有限公司 神经网络处理器及使用处理器执行向量最大值指令的方法
US11113397B2 (en) * 2019-05-16 2021-09-07 Cisco Technology, Inc. Detection of malicious executable files using hierarchical models
CN114450699A (zh) * 2019-09-24 2022-05-06 阿里巴巴集团控股有限公司 由处理单元实现的方法、可读存储介质和处理单元
CN111027690B (zh) * 2019-11-26 2023-08-04 陈子祺 执行确定性推理的组合处理装置、芯片和方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017124648A1 (zh) * 2016-01-20 2017-07-27 北京中科寒武纪科技有限公司 一种向量计算装置
CN110007961A (zh) * 2019-02-01 2019-07-12 中山大学 一种基于risc-v的边缘计算硬件架构
CN111078287A (zh) * 2019-11-08 2020-04-28 苏州浪潮智能科技有限公司 一种向量运算协处理方法与装置
CN111158756A (zh) * 2019-12-31 2020-05-15 百度在线网络技术(北京)有限公司 用于处理信息的方法和装置
CN112633505A (zh) * 2020-12-24 2021-04-09 苏州浪潮智能科技有限公司 一种基于risc-v的人工智能推理方法和系统

Also Published As

Publication number Publication date
US20230367593A1 (en) 2023-11-16
US11880684B2 (en) 2024-01-23
CN112633505B (zh) 2022-05-27
CN112633505A (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
WO2022134729A1 (zh) 一种基于risc-v的人工智能推理方法和系统
US20200341758A1 (en) Convolutional Neural Network Hardware Acceleration Device, Convolutional Calculation Method, and Storage Medium
CN109240746B (zh) 一种用于执行矩阵乘运算的装置和方法
TWI789358B (zh) 一種能支援不同位元寬運算資料的運算單元、方法及裝置
WO2022170997A1 (zh) 基于risc-v指令集进行数据处理的方法、系统、设备及介质
CN107766079B (zh) 处理器以及用于在处理器上执行指令的方法
JP2014501009A (ja) データを移動させるための方法及び装置
CN111078287B (zh) 一种向量运算协处理方法与装置
CN111860773B (zh) 处理装置和用于信息处理的方法
US11403104B2 (en) Neural network processor, chip and electronic device
WO2021115208A1 (zh) 神经网络处理器、芯片和电子设备
CN112667289B (zh) 一种cnn推理加速系统、加速方法及介质
WO2022142479A1 (zh) 一种硬件加速器、数据处理方法、系统级芯片及介质
US20060265571A1 (en) Processor with different types of control units for jointly used resources
CN110908716A (zh) 一种向量聚合装载指令的实现方法
WO2021115149A1 (zh) 神经网络处理器、芯片和电子设备
KR102635978B1 (ko) 생성형 거대 언어 모델의 연산 가속을 위해 메모리대역폭 사용을 극대화하기 위한 혼합정밀도 mac 트리 구조
US10127040B2 (en) Processor and method for executing memory access and computing instructions for host matrix operations
US20100281234A1 (en) Interleaved multi-threaded vector processor
JP7250953B2 (ja) データ処理装置、及び人工知能チップ
US20210150311A1 (en) Data layout conscious processing in memory architecture for executing neural network model
US20100281236A1 (en) Apparatus and method for transferring data within a vector processor
JPH01255036A (ja) マイクロプロセッサ
US20090287860A1 (en) Programmable Direct Memory Access Controller
CN114418077A (zh) 一种加速神经网络计算的方法、系统、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908730

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21908730

Country of ref document: EP

Kind code of ref document: A1