WO2022170997A1 - Data processing method and system based on risc-v instruction set, and device and medium - Google Patents

Data processing method and system based on risc-v instruction set, and device and medium Download PDF

Info

Publication number
WO2022170997A1
WO2022170997A1 PCT/CN2022/074414 CN2022074414W WO2022170997A1 WO 2022170997 A1 WO2022170997 A1 WO 2022170997A1 CN 2022074414 W CN2022074414 W CN 2022074414W WO 2022170997 A1 WO2022170997 A1 WO 2022170997A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
instruction
cache
coefficient
vector
Prior art date
Application number
PCT/CN2022/074414
Other languages
French (fr)
Chinese (zh)
Inventor
贾兆荣
Original Assignee
山东英信计算机技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东英信计算机技术有限公司 filed Critical 山东英信计算机技术有限公司
Publication of WO2022170997A1 publication Critical patent/WO2022170997A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3814Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of data processing, and more particularly, to a method, system, computer device and readable medium for data processing based on the RISC-V instruction set.
  • the value of data lies in analysis and utilization, not simple storage.
  • the amount of data is constantly growing, and it is impossible to transmit all data to the cloud through the network, and the speed of bandwidth growth is slower than the speed of data growth.
  • we need to judge the data at the edge such as autonomous driving, unmanned driving and other fields.
  • For scenarios with high privacy protection requirements such as medical information or data that users are unwilling to share in the cloud, it needs to be stored locally.
  • most of the data generated by security equipment is useless or data that has no potential to be tapped. It is a waste of bandwidth to transmit all data to the cloud.
  • intelligent analysis is performed at the edge, only useful or potential data is transmitted to the cloud. Greatly saves network bandwidth. Therefore, the transfer of data processing from the cloud to the edge is an inevitable trend. Therefore, edge-end AI (artificial intelligence, artificial intelligence) chips are also the general trend.
  • AI chip Artificial intelligence processing at the edge requires AI chips, and the challenges faced by AI chips are mainly computing power and computing efficiency.
  • the computing power of an AI chip is determined by the number of on-chip computing units. Since the amount of data involved in AI computing is very large, in theory, the larger the computing power of an AI chip, the better, but in fact, the computing power of an AI chip is restricted by various factors:
  • On-chip storage bandwidth and bus bandwidth The main contradiction of AI chips is the contradiction between storage bandwidth and computing power. The greater the computing power, the greater the amount of input data, intermediate results, and output data, and the higher the required storage bandwidth. However, the current storage bandwidth is far from meeting the computing power requirements. If the computing units and storage units cannot be reasonably arranged, it will lead to There is a large but inefficient result.
  • a deep neural network model usually consists of multiple layers, and the output of the previous layer is the input of the next layer; in the same layer, the result of the multiplication and addition operation is often the input of activation, pooling, and normalization. Therefore, if multi-threading/parallel computing/computation pipeline cannot be implemented reasonably, the calculation of the previous step will hinder the calculation of the next step, causing waste of resources and reducing computing efficiency.
  • AI chip As mentioned in 2, there are various operators involved in AI, but the AI chip is fixed. How to make the unchanged hardware efficiently handle the variable operators requires that the software can be reasonably based on the hardware architecture. Allocate hardware resources and compile efficient machine code. At the same time, AI chips are also required to have efficient control capabilities.
  • the purpose of the embodiments of the present application is to propose a method, system, computer equipment and computer-readable storage medium for data processing based on RISC-V instruction set, through AIPU (AI process unit, artificial intelligence processing unit) and CPU shares memory, making computing and storage adjacent, improving memory access bandwidth, facilitating data interaction between AIPU and CPU, reducing the amount of data interaction with external buses, and reducing the demand for bus bandwidth.
  • AIPU and the CPU each have a small buffer (cache) used to cache input data, intermediate results, output data and CPU pre-reading instructions, allowing data to be loaded at the same time as data calculation and prolonging data read and write time. Further reducing the need for bus bandwidth.
  • an aspect of the embodiments of the present application provides a method for data processing based on a RISC-V instruction set, including the following steps: acquiring an instruction from the RISC-V instruction space and caching it in the cache, and judging the instruction type; in response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding branch according to the instruction address; in response to jumping to the AIPU branch, through the first level input feature cache and the first
  • the first-level coefficient cache stores the feature data and coefficient data for the current convolution operation, and stores the feature data and coefficient data of the next convolution operation through the second-level input feature cache and the second-level coefficient cache; and according to the corresponding feature data Perform convolution operation with coefficient data, and activate, normalize and pool the result obtained by operation.
  • the method further includes performing a vector operation according to the instruction in response to the jump to the vector architecture branch.
  • the method further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.
  • the method further includes: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.
  • the method further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
  • the method further includes: reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation according to the requirements of the operation, and writing the converted data into the corresponding coefficients Cache or input feature cache.
  • the method further includes: in response to performing the convolution calculation, reading data in the first-level input feature buffer and the first-level coefficient buffer, and determining the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.
  • Another aspect of the embodiments of the present application provides a system for data processing based on a RISC-V instruction set, including: an acquisition module configured to acquire instructions from the RISC-V instruction space and cache them in the cache, and determine the type of the instruction; the jump module, configured to regenerate the instruction address in response to the instruction being a branch jump instruction, and to jump to the corresponding branch according to the instruction address; the AIPU module, configured to jump to the AIPU in response to the instruction Branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and store the next convolution operation through the second-level input feature cache and the second-level coefficient cache. feature data and coefficient data; and an execution module configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.
  • a computer device including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor.
  • the processor implements the steps of the above method when executed.
  • a computer-readable storage medium stores a computer program that implements the above method steps when executed by a processor.
  • the AIPU AI process unit, artificial intelligence processing unit
  • shares memory with the CPU wherein the AIPU establishes the first-level input feature cache, the first-level coefficient cache and the first-level coefficient cache in the shared memory.
  • the second-level input feature cache and the second-level coefficient cache make calculation and storage adjacent, improve memory access bandwidth, facilitate data interaction between AIPU and CPU, reduce the amount of data interaction with external buses, and reduce the bandwidth of the bus. demand.
  • there is a small buffer inside the AIPU and CPU to cache input data, intermediate results, output data and CPU pre-reading instructions allowing data to be loaded at the same time as data calculation, prolonging data reading and writing time, and further reducing the need for bus bandwidth requirements.
  • FIG. 1 is a schematic diagram of an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application
  • FIG. 2 is a schematic diagram of a CPU architecture in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an AIPU architecture provided by the present application.
  • Fig. 4 is the schematic diagram of convolution operation in the embodiment of the method for data processing based on RISC-V instruction set provided by this application;
  • FIG. 5 is a schematic diagram of the hardware structure of an embodiment of a computer device for data processing based on a RISC-V instruction set provided by the present application;
  • FIG. 6 is a schematic diagram of an embodiment of a computer storage medium for data processing based on a RISC-V instruction set provided by the present application.
  • FIG. 1 shows a schematic diagram of an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application.
  • the embodiment of the present application includes the following steps:
  • a storage-computing integrated structure is adopted, and the AIPU and the CPU share memory, wherein the AIPU establishes the first-level input feature cache, the first-level coefficient cache, and the second-level input in the shared memory.
  • Feature cache and second-level coefficient cache make computing adjacent to storage, improve memory access bandwidth, facilitate data interaction between AIPU and CPU, reduce the amount of data interaction with external buses, and reduce the demand for bus bandwidth.
  • there is a small buffer inside the AIPU and CPU to cache input data, intermediate results, output data and CPU pre-reading instructions allowing data to be loaded at the same time as data calculation, prolonging data reading and writing time, and further reducing the need for bus bandwidth requirements.
  • RISC-V instruction set includes general instruction set and vector extended instruction set, which can be divided into: integer instruction set I, multiply-add operation instruction set M, atomic operation instruction set A, single-precision instruction set F, double-precision instruction set D, compression Instruction set C and vector instruction set V.
  • the arithmetic logic operation unit completes the IMAFDC instruction set operation
  • the vector operation unit completes the vector instruction set V operation.
  • the CPU architecture is designed according to the RISC-V instruction set. The function of the CPU is to run system code and complete system control and data operations.
  • FIG. 2 shows a schematic diagram of a CPU architecture in an embodiment of the present application.
  • the CPU adopts a two-stage pipeline architecture.
  • the first stage is the instruction fetch stage, which is responsible for fetching the instruction cache from the instruction storage space into the instruction cache.
  • the second stage decodes and executes the instruction.
  • decoding analyze the type of the instruction (vector instruction or ordinary instruction), and start the corresponding data operation according to the corresponding instruction type and opcode.
  • the vector add instruction will read the data from the vector data storage to the vector register, and then in The operation is completed in the vector operation unit, and the result is cached in the vector data cache.
  • vector data cache In AI inference calculation, vector operations are usually not independent, and it is often necessary to reasonably complete the calculation with multiple vector operations in the form of pipelines. If the intermediate results are stored in the data sram (Static Random Access Memory , static random access memory), the vector data may take multiple cycles to complete the store or read, which will greatly increase the vector calculation cycle. Setting a vector cache buffer can load data into the vector cache buffer in advance before starting the vector calculation, and store the final result in the data sram after the vector calculation is completed. Both prefetching and result storage of vector data can be done during other operations, reducing vector operation cycles. The port of the vector data cache module is wide to meet the bandwidth requirements of the vector operation unit.
  • the instruction is obtained from the RISC-V instruction space and cached in the cache, and it is judged whether the instruction is a branch jump instruction; in response to the instruction being a branch jump instruction, the instruction address is regenerated, and the corresponding branch is jumped according to the instruction address.
  • the branch jump is established (or unconditional jump)
  • the pc instruction address
  • the architecture has three architecture branches, namely: general architecture branch, which is used to support general-purpose instructions and realize the functions of CPU; vector architecture branch, which is used to support RISC-V vector instruction set and complete vector operations; AIPU branch, which supports General load/store instructions, support custom user instructions, used to complete special intensive calculations such as convolution and matrix multiplication.
  • the AIPU branch can establish a connection with the AIPU architecture.
  • the AIPU branch configures the registers of each functional module through the load (load)/store (store) instructions of the CPU.
  • the work of each functional module in the AIPU is only controlled by the registers and does not require the participation of CPU instructions. Therefore, the calculation efficiency is high but not flexible enough. for special large-scale computing.
  • the vector architecture branch is controlled by the vector instructions of the CPU, and each step of the operation requires instruction control. It can be seen that the vector architecture branch is more flexible than the AIPU, but the calculation efficiency is lower, and it is suitable for small batches and diversified vector calculations. Since the vector operation involves a lot of data, how to speed up the data load and store is the key.
  • the feature data and coefficient data for the current convolution operation are stored through the first-level input feature cache and the first-level coefficient cache, and stored through the second-level input feature cache and the second-level coefficient cache Feature data and coefficient data for the next convolution operation.
  • the input feature vector buffer and coefficient vector buffer are mainly used to buffer the data to be calculated in the current clock cycle of the multiply-add operation unit, and these data are all calculated in parallel in the form of vectors.
  • FIG. 3 shows a schematic diagram of the AIPU architecture provided by this application.
  • the AIPU architecture includes register files, DMA, read and write interface arbitration, address generators, convolution timing controllers, vector caches, multiply-add operation matrices, intermediate result accumulators, and special vector operation units.
  • the core of the AIPU architecture is the multiplication and addition matrix module, which contains a large number of multiplication and addition hardware resources, which can realize parallel and high-speed multiplication and addition operations to meet the computing power requirements of intensive convolution/matrix operations; other modules are for Make the convolution operation more efficient.
  • the data multiplexing introduced is to solve the problem that the data demand is large during the calculation, but the bandwidth of the data bus and SRAM is not enough.
  • the read data is reused as much as possible to reduce the pressure on the bandwidth; buffer
  • the setting of (cache) is to adjust the data throughput rate of the modules before and after the buffer, reduce the occurrence of blocking, and make each functional module run at full speed without blocking;
  • the vector operation unit can provide different algorithm support according to the requirements of the convolution algorithm, so that First, the data can be read to complete the operation used, and then stored, instead of reading the data multiple times to complete a complete convolution calculation;
  • the address generator cooperates with read and write control, and can read and write data through different read and write data.
  • Sequential data arrangement makes the convolution operation more efficient; the convolutional neural network used in AI computing is usually divided into many layers, and the AI inference chip requires layer-by-layer calculation, each layer has a large number of convolution or matrix operations, including After the ping-pong register is established, the parameters required for the calculation of the next layer of AIPU, such as data dimensions and other information, can be configured while calculating the current layer. In this way, after the end of this layer, the calculation of the next layer can start immediately. It reduces the computing time of the entire neural network and improves the computing efficiency.
  • FIG. 4 shows a schematic diagram of a convolution operation in an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application.
  • the f0 vector block is simultaneously multiplied and added with w0...w7 (the vector block contains multiple vector elements, and the multiplication and addition operation is to multiply the corresponding vector elements and then multiply the elements of all elements. Multiply and accumulate, and the accumulated sum is the output result).
  • f0 and w0...w7 to the multiplication and addition matrix, f0 is equivalent to 8 times of copying, and do multiplication and addition operations with w0...w7 vector blocks respectively.
  • f1...f7 all need to do multiplication and addition with w0...w7.
  • f0...f7 multiplex w0...w7 vector blocks, and each w vector block multiplexes the same f vector block. Therefore, in these 8 matrix operations, it is only necessary to take w0...w7 once, and then read a block of f vectors for each calculation. 8 operations require 8 clock cycles, and reading w0...w7 also requires 8 cycles, so the process of reading the w vector block can be hidden in the calculation process (that is, the process of reading data completely overlaps with the calculation process, no need Interrupt the data calculation process and wait for the data to be read). This is why the input feature vector buffer and coefficient vector buffer need to be set up.
  • the intermediate result cache is used to cache the intermediate results of vector calculations. According to the convolution principle, a vector multiplication and addition operation cannot obtain the final result, and the results of multiple multiplications and additions need to be accumulated. Therefore, a cache is set after the multiply-accumulate result accumulator. When the intermediate results are continuously accumulated to obtain the complete final result, the complete result needs to be stored in the complete result cache buffer.
  • This cache buffer has multiple functions:
  • the cache buffer is shared by the subsequent activation modules, pooling modules, etc., and is used to store the input data and output data of these computing modules;
  • the module has a bus read and write control to send the final calculation data to the DMA interface.
  • the method further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
  • the register file can be set as a system register when adding a compiler backend, and the configuration information is loaded by the load instruction.
  • the register file is configured into two parts and performs ping-pong operation, that is: when the first part controls the AIPU operation process, the second part accepts the parameters required for the next calculation of the AIPU, and when the first part of the operation is completed, the second part of the register is converted to the currently available register. register. This can ensure continuous and uninterrupted work of the AIPU.
  • register file configuration and conversion The principle of register file configuration and conversion is as follows: Since two sets of registers are added to the back end of the compiler when the chip architecture is described in the back end of the compiler, the compiler will find the corresponding registers according to the description of the registers in the architecture. For example, load r0, addr loads the data at address into register 0, load r1, addr loads the data into register 1. However, when the AIPU uses registers, it needs to determine which register is available. At this time, a "calculation complete" signal is required to alternately enable register 0 and register 1. During programming, after enabling an AIPU calculation, another register needs to be configured immediately to prepare for the next AIPU startup.
  • the method further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.
  • RISC-V instructions usually have one or two source operands rs1 and rs2, and the corresponding vector source operands are vs1 and vs2.
  • the instruction sends the source operand to the corresponding execution unit (including data load, data storage, scalar calculation, vector calculation, etc.) according to the opcode (representing the type of calculation, such as addition, subtraction, multiplication, division, etc.).
  • the opcode represents load/store, it indicates that the instruction is an access storage instruction, and the execution unit reads the address of the data storage space into the destination operand (rd or vd) according to the address in rs1.
  • the method further includes: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.
  • the software divides the 32 registers into 8 groups, and the corresponding relationship with the hardware is 2 Each software vector group corresponds to one hardware vector group. If the vector registers vs1 and vs2 are in the same group during calculation, only one port is enabled to read and write, and if they are in two groups, two ports are enabled to read and write at the same time.
  • the method further includes performing a vector operation according to the instruction in response to the jump to the vector architecture branch.
  • the method further includes: according to the requirements of the operation, reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation, and writing the converted data into the corresponding coefficient buffer or input feature cache.
  • both the coefficient cache unit and the input feature cache unit need to read the weight and feature data from the shared external SRAM, and the address generator generates the corresponding SRAM address according to the register configuration.
  • Convolution calculation or matrix operation will have different calculation methods according to different applications. For example, convolution calculation is divided into one-dimensional/two-dimensional/three-dimensional convolution, hole convolution, depthwise convolution, separated convolution, and transposed volume. accumulate and so on. Different calculation methods read data in different ways.
  • Convolution calculation usually also transforms the dimension of the data accordingly, which requires the address generator to read data in different ways according to the configuration of the register, and complete these conversions in disguise . It can be seen that the functions of the address generator and read and write data control are: according to different calculation requirements, complete the reading of data and make corresponding dimension conversion, and then write the corresponding coefficient (weight) buffer unit or input feature (feature) ) in the cache unit.
  • the convolution timing control unit is the control core of the entire AIPU, which is responsible for collecting the status of each functional module, controlling the enabling of related modules, and generating the synchronization signal of the convolution operation.
  • the convolution sync signal is the beat of the entire convolution process.
  • the size of M is determined by the number of times of data multiplexing in the convolution process.
  • the synchronization signal of the data loading is the synchronization signal of the convolution calculation after a fixed read and write data cycle delay.
  • the synchronization signal of the accumulator is the delayed signal of the convolution synchronization signal after a fixed multiply-add operation cycle.
  • the method further includes: in response to performing the convolution calculation, reading data in the first-level input feature buffer and the first-level coefficient buffer, and determining the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.
  • the remaining available storage space of the input feature cache and coefficient cache is determined by the two processes of reading and writing data. Data writing reduces the remaining available space, and data reading increases the remaining available space.
  • the convolution timing controller calculates the available space size of the cache according to the number of times of reading and writing the cache.
  • the data in the two caches is enough to start the convolution operation (for example, the coefficient data meets the requirement of the number of multiplexing, and the input feature data meets the requirement of multiple computations).
  • requirements, and the calculation time is greater than or equal to the data loading time required for the next calculation) enable convolution.
  • the input feature buffer and coefficient buffer are continuously read, so that the remaining space of the two buffers gradually increases.
  • the write cache is enabled. Therefore, if the load time of the next set of data is less than the convolution calculation time of the previous set, the convolution calculation will run uninterrupted. If the calculation is fast and the data loading is slow, there will be interruptions in the convolution calculation process.
  • the convolution operation is performed according to the corresponding feature data and coefficient data, and the results obtained by the operation are activated, normalized and pooled.
  • the data needs to be activated (such as Relu), normalized, and pooled after the multiplication and addition calculation is completed.
  • a vector operation unit with slow computational efficiency outside the AIPU is used, a large number of intermediate results of multiplication and addition operations will be accumulated before the vector operation, waiting for activation or pooling, and the efficiency of the entire convolution operation will be lowered by the vector operation unit. Therefore, vector operations such as activations required for convolution are specialized and placed after the multiply-add matrix unit. Activation, etc.
  • the dedicated vector calculation unit can be connected in series with the multiply-accumulate unit and the accumulation unit, or can work independently, and the intermediate result cache unit is shared by these dedicated vector calculation units.
  • the processor architecture of three instruction branches is designed, namely: general instruction branch, vector instruction branch, and AIPU branch;
  • the AIPU architecture is designed.
  • the AIPU is combined with the RISC-V architecture in the form of an accelerator. It has a dedicated register file and is configured by the RISC-V load/store instruction to accelerate convolution and matrix operations;
  • the architecture of the AIPU multiply-add operation array is designed, which is a two-dimensional parallel multiply-add operation unit.
  • the front-level double buffer (the purpose is to make the subsequent units have continuous data) is composed of the input feature buffer, coefficient buffer and convolution control unit. Using the method of real-time monitoring of the remaining space, the data is continuously read out of the buffer at the same time. , write the data required for the next step into the buffer.
  • the back-end buffer is to realize the functions of increasing bandwidth and data multiplexing.
  • a flexible address generator is designed: according to the configuration of the register, the address generator is matched with the buffer of the latter stage to complete the transformation of the data dimension while reading the data;
  • the ping-pong operation register is designed to ensure the uninterrupted operation of the two different convolution operations before and after.
  • the application of the architecture in the embodiment of the present application is very flexible, and has both the control function of a general-purpose CPU and the computing power required by AI. It can be applied to edge-end machines of artificial intelligence and IoT. It can also achieve greater computing power through the Internet-on-Chip (NoC), and install it on a PC or server in the form of an accelerator card to realize cloud-based reasoning or training.
  • NoC Internet-on-Chip
  • a system for data processing based on the RISC-V instruction set including: an acquisition module configured to acquire an instruction cache from the RISC-V instruction space and store it in the cache , and determine the type of the instruction; the jump module is configured to regenerate the instruction address in response to the instruction being a branch jump instruction, and jump to the corresponding branch according to the instruction address; the AIPU module is configured to respond It jumps to the AIPU branch, stores the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and stores the next level through the second-level input feature cache and the second-level coefficient cache.
  • feature data and coefficient data of a one-step convolution operation and an execution module configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.
  • system further includes a vector module configured to perform vector operations according to the instructions in response to jumping to the vector architecture branch.
  • the system further includes a first judgment module configured to: in response to the instruction being a load or store instruction, read the address of the storage space into the destination operand according to the address in the source operand.
  • the system further includes a second judgment module configured to: judge whether the vector registers corresponding to the vector source operands are in the same group; and respond that the vector registers corresponding to the vector source operands are not in the same group Inside, two ports with the same bit width as the vector register can be read and written at the same time.
  • system further includes a configuration module configured to configure the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
  • the system further includes a conversion module configured to read data corresponding to the operation and perform dimension conversion on the data corresponding to the operation according to the requirements of the operation, and write the converted data into into the corresponding coefficient buffer or input feature buffer.
  • a conversion module configured to read data corresponding to the operation and perform dimension conversion on the data corresponding to the operation according to the requirements of the operation, and write the converted data into into the corresponding coefficient buffer or input feature buffer.
  • the system further includes a computing module configured to read data in the first-level input feature buffer and the first-level coefficient buffer in response to performing the convolution calculation, and determine the first-level input feature Whether the remaining space of the cache and the first-level coefficient cache is larger than the size of the next set of data; and in response to the first-level input feature cache and the remaining space of the first-level coefficient cache being larger than the size of the next set of data, enable write caching .
  • a computing module configured to read data in the first-level input feature buffer and the first-level coefficient buffer in response to performing the convolution calculation, and determine the first-level input feature Whether the remaining space of the cache and the first-level coefficient cache is larger than the size of the next set of data; and in response to the first-level input feature cache and the remaining space of the first-level coefficient cache being larger than the size of the next set of data, enable write caching .
  • a computer device including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor to The following steps are implemented: S1. Obtain the instruction from the RISC-V instruction space and cache it in the cache, and determine the type of the instruction; S2. In response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding instruction address according to the instruction address. branch; S3, in response to jumping to the AIPU branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and pass the second-level input feature cache and the first-level coefficient cache. The second-level coefficient cache stores the feature data and coefficient data of the next convolution operation; and S4, performs convolution operation according to the corresponding feature data and coefficient data, and activates, normalizes and pools the result obtained by the operation.
  • the steps further comprise: in response to jumping to the vector architecture branch, performing a vector operation according to the instruction.
  • the step further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.
  • the steps further include: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.
  • the step further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
  • the step further includes: reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation according to the requirements of the operation, and writing the converted data into the corresponding coefficients Cache or input feature cache.
  • the step further includes: in response to performing the convolution calculation, reading the data in the first-level input feature buffer and the first-level coefficient buffer, and judging the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.
  • FIG. 5 it is a schematic diagram of the hardware structure of an embodiment of the above-mentioned computer device for data processing based on the RISC-V instruction set provided for this application.
  • the device includes a processor 201 and a memory 202 , and may also include an input device 203 and an output device 204 .
  • the processor 201 , the memory 202 , the input device 203 and the output device 204 may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 5 .
  • the memory 202 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules.
  • the program instruction/module corresponding to the data processing method.
  • the processor 201 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 202, that is, the data processing based on the RISC-V instruction set of the above method embodiments is implemented. Methods.
  • the memory 202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required by at least one function; the storage data area may store the use of the method for data processing based on the RISC-V instruction set created data, etc. Additionally, memory 202 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 202 may optionally include memory located remotely from processor 201, which may be connected to local modules via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the input device 203 can receive input information such as user name and password.
  • the output device 204 may include a display device such as a display screen.
  • One or more program instructions/modules corresponding to the method for data processing based on the RISC-V instruction set are stored in the memory 202, and when executed by the processor 201, the execution based on the RISC-V instruction set in any of the above method embodiments is performed. method of data processing.
  • Any embodiment of a computer device that executes the above-mentioned method for data processing based on a RISC-V instruction set can achieve the same or similar effects as any of the foregoing method embodiments corresponding to it.
  • the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program that executes the above method when executed by a processor.
  • FIG. 6 it is a schematic diagram of an embodiment of the above-mentioned computer storage medium for data processing based on the RISC-V instruction set provided for this application.
  • the computer readable storage medium 3 stores a computer program 31 that executes the above method when executed by the processor.
  • the program of the method for data processing based on the RISC-V instruction set can be Stored in a computer-readable storage medium, when the program is executed, it may include the processes of the above-mentioned method embodiments.
  • the storage medium of the program may be a magnetic disk, an optical disk, a read only memory (ROM) or a random access memory (RAM) or the like.
  • the above computer program embodiments can achieve the same or similar effects as any of the foregoing method embodiments corresponding thereto.
  • the storage medium can be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

Disclosed in the present application are a data processing method and system based on an RISC-V instruction set, and a device and a storage medium. The method comprises: acquiring an instruction from an RISC-V instruction space, caching the instruction into a cache, and determining the type of the instruction; in response to the instruction being a branch skipping instruction, regenerating an instruction address, and skipping to a corresponding branch according to the instruction address; in response to skipping to an AIPU branch, storing, by means of a first-stage input feature cache and a first-stage coefficient cache, feature data and coefficient data which are used for the current convolution operation, and storing, by means of a second-stage input feature cache and a second-stage coefficient cache, feature data and coefficient data which are used for the next convolution operation; and performing a convolution operation according to the corresponding feature data and coefficient data, and performing activation, normalization and pooling on a result obtained by means of the operation. By means of the present application, a processor architecture with three instruction branches is designed according to an RISC-V instruction set, such that general control, vector operation, convolution and matrix acceleration calculation are realized. The present application is suitable for an AI inference chip of a terminal.

Description

基于RISC-V指令集进行数据处理的方法、系统、设备及介质Method, system, device and medium for data processing based on RISC-V instruction set
本申请要求于2021年2月9日提交的、申请号为202110175746.6、发明名称为“基于RISC-V指令集进行数据处理的方法、系统、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on February 9, 2021 with the application number 202110175746.6 and the invention titled "Method, System, Device and Medium for Data Processing Based on RISC-V Instruction Set", all of which are The contents are incorporated herein by reference.
技术领域technical field
本申请涉及数据处理领域,更具体地,特别是指一种基于RISC-V指令集进行数据处理的方法、系统、计算机设备及可读介质。The present application relates to the field of data processing, and more particularly, to a method, system, computer device and readable medium for data processing based on the RISC-V instruction set.
背景技术Background technique
数据的价值在于分析利用,而非简单存储。数据量在不断在增长,不可能把所有数据都通过网络传到云端,带宽增长的速度是慢于数据增长的速度。对于实时性要求较高的应用场景,我们需要在边缘对数据进行判断,比如自动驾驶、无人驾驶等领域。对于隐私保护要求比较高的场景,比如医疗信息或者用户不愿意进行云端分享的数据,需要在本地进行存储。比如安防设备所产生的数据大部分是没有用或没有挖掘潜力的数据,全部传送到云端是对带宽一种浪费,若在边缘端做智能分析,只向云端传送有用或有潜力的数据,会大大节省网络带宽。因此数据处理从云端转移到边缘端是一种必然趋势,因此,边缘端AI(artificial intelligence,人工智能)芯片也是大势所趋。The value of data lies in analysis and utilization, not simple storage. The amount of data is constantly growing, and it is impossible to transmit all data to the cloud through the network, and the speed of bandwidth growth is slower than the speed of data growth. For application scenarios with high real-time requirements, we need to judge the data at the edge, such as autonomous driving, unmanned driving and other fields. For scenarios with high privacy protection requirements, such as medical information or data that users are unwilling to share in the cloud, it needs to be stored locally. For example, most of the data generated by security equipment is useless or data that has no potential to be tapped. It is a waste of bandwidth to transmit all data to the cloud. If intelligent analysis is performed at the edge, only useful or potential data is transmitted to the cloud. Greatly saves network bandwidth. Therefore, the transfer of data processing from the cloud to the edge is an inevitable trend. Therefore, edge-end AI (artificial intelligence, artificial intelligence) chips are also the general trend.
在边缘端做人工智能处理需要AI芯片,AI芯片面临的挑战主要为计算算力和计算效率。AI芯片的算力由片上计算单元的数量决定的。由于AI计算涉及的数据量非常大,理论上AI芯片的算力越大越好,但实际上AI芯片的算力受多种因素制约:Artificial intelligence processing at the edge requires AI chips, and the challenges faced by AI chips are mainly computing power and computing efficiency. The computing power of an AI chip is determined by the number of on-chip computing units. Since the amount of data involved in AI computing is very large, in theory, the larger the computing power of an AI chip, the better, but in fact, the computing power of an AI chip is restricted by various factors:
1.片上存储带宽及总线带宽:AI芯片的主要矛盾为存储带宽和计算算 力之间的矛盾。算力越大,输入数据、中间结果、输出数据量越大,要求的存储带宽越高,但目前存储带宽远远不能满足算力的需求,如果不能合理布置计算单元与存储单元,就会导致出现算力大但效率不高的结果。1. On-chip storage bandwidth and bus bandwidth: The main contradiction of AI chips is the contradiction between storage bandwidth and computing power. The greater the computing power, the greater the amount of input data, intermediate results, and output data, and the higher the required storage bandwidth. However, the current storage bandwidth is far from meeting the computing power requirements. If the computing units and storage units cannot be reasonably arranged, it will lead to There is a large but inefficient result.
2.AI计算涉及到的算子多种多样,如卷积计算、矩阵计算、归一化、激活、池化等多种线性及非线性计算。深度神经网络模型通常由多层组成,前一层的输出是下一层的输入;同一层中,乘加运算的结果往往是激活、池化、归一化的输入。因此,如果不能合理的实现多线程/并行计算/计算流水,上一步的计算会阻碍下一步的运算,造成资源浪费,降低计算效率。2. There are various operators involved in AI computation, such as convolution computation, matrix computation, normalization, activation, pooling and other linear and nonlinear computations. A deep neural network model usually consists of multiple layers, and the output of the previous layer is the input of the next layer; in the same layer, the result of the multiplication and addition operation is often the input of activation, pooling, and normalization. Therefore, if multi-threading/parallel computing/computation pipeline cannot be implemented reasonably, the calculation of the previous step will hinder the calculation of the next step, causing waste of resources and reducing computing efficiency.
3.如2所述,AI涉及到的算子多种多样,但AI芯片是固定不变的,如何使不变的硬件高效的处理多变的算子,这就需要软件能够根据硬件架构合理分配硬件资源,编译出高效的机器码。同时还要求AI芯片具备高效的控制能力。3. As mentioned in 2, there are various operators involved in AI, but the AI chip is fixed. How to make the unchanged hardware efficiently handle the variable operators requires that the software can be reasonably based on the hardware architecture. Allocate hardware resources and compile efficient machine code. At the same time, AI chips are also required to have efficient control capabilities.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本申请实施例的目的在于提出一种基于RISC-V指令集进行数据处理的方法、系统、计算机设备及计算机可读存储介质,通过AIPU(AI process unit,人工智能处理单元)和CPU共享内存,使计算与存储相邻,提高内存访问带宽,方便AIPU与CPU的数据交互,减少了与外部总线的数据交互量,减少了对总线带宽的需求。同时AIPU和CPU内部各有一个小的buffer(缓存)用于缓存输入数据、中间结果、输出数据以及CPU的预读取的指令,允许在数据计算的同时进行数据加载,延长数据读写时间,进一步减少对总线带宽的需求。In view of this, the purpose of the embodiments of the present application is to propose a method, system, computer equipment and computer-readable storage medium for data processing based on RISC-V instruction set, through AIPU (AI process unit, artificial intelligence processing unit) and CPU shares memory, making computing and storage adjacent, improving memory access bandwidth, facilitating data interaction between AIPU and CPU, reducing the amount of data interaction with external buses, and reducing the demand for bus bandwidth. At the same time, the AIPU and the CPU each have a small buffer (cache) used to cache input data, intermediate results, output data and CPU pre-reading instructions, allowing data to be loaded at the same time as data calculation and prolonging data read and write time. Further reducing the need for bus bandwidth.
基于上述目的,本申请实施例的一方面提供了一种基于RISC-V指令集进行数据处理的方法,包括如下步骤:从RISC-V指令空间中获取指令缓存到缓存中,并判断所述指令的类型;响应于所述指令为分支跳转指令,重新生成指令地址,并根据所述指令地址跳转到对应的分支;响应于跳转到AIPU分支,通过第一级输入特征缓存和第一级系数缓存存储用于当前卷积运算的特征数据和系数数据,并通过第二级输入特征缓存和第二级系数缓 存存储下一步卷积运算的特征数据和系数数据;以及根据对应的特征数据和系数数据进行卷积运算,并对运算得到的结果进行激活、归一化和池化。Based on the above purpose, an aspect of the embodiments of the present application provides a method for data processing based on a RISC-V instruction set, including the following steps: acquiring an instruction from the RISC-V instruction space and caching it in the cache, and judging the instruction type; in response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding branch according to the instruction address; in response to jumping to the AIPU branch, through the first level input feature cache and the first The first-level coefficient cache stores the feature data and coefficient data for the current convolution operation, and stores the feature data and coefficient data of the next convolution operation through the second-level input feature cache and the second-level coefficient cache; and according to the corresponding feature data Perform convolution operation with coefficient data, and activate, normalize and pool the result obtained by operation.
在一些实施方式中,方法还包括:响应于跳转到向量架构分支,根据所述指令进行向量操作。In some embodiments, the method further includes performing a vector operation according to the instruction in response to the jump to the vector architecture branch.
在一些实施方式中,方法还包括:响应于所述指令为加载或存储指令,根据源操作数中的地址,读取存储空间的地址到目的操作数中。In some embodiments, the method further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.
在一些实施方式中,方法还包括:判断向量源操作数对应的向量寄存器是否在同一个组内;以及响应于所述向量源操作数对应的向量寄存器不在同一个组内,使两个与所述向量寄存器位宽相同的端口同时进行读写。In some embodiments, the method further includes: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.
在一些实施方式中,方法还包括:将所述AIPU分支中的寄存器文件配置成两部分,第一部分运行当前AIPU运算,第二部分获取AIPU下一步运算所需要的参数。In some embodiments, the method further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
在一些实施方式中,方法还包括:根据运算的需求,读取与所述运算对应的数据并对所述与所述运算对应的数据进行维度转换,并将转换后的数据写入对应的系数缓存或输入特征缓存。In some embodiments, the method further includes: reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation according to the requirements of the operation, and writing the converted data into the corresponding coefficients Cache or input feature cache.
在一些实施方式中,方法还包括:响应于进行卷积计算,读取所述第一级输入特征缓存和第一级系数缓存中的数据,并判断所述第一级输入特征缓存和第一级系数缓存剩余的空间是否大于下一组数据的大小;以及响应于所述第一级输入特征缓存和第一级系数缓存剩余的空间大于下一组数据的大小,开启写缓存。In some embodiments, the method further includes: in response to performing the convolution calculation, reading data in the first-level input feature buffer and the first-level coefficient buffer, and determining the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.
本申请实施例的另一方面,提供了一种基于RISC-V指令集进行数据处理的系统,包括:获取模块,配置为从RISC-V指令空间中获取指令缓存到缓存中,并判断所述指令的类型;跳转模块,配置为响应于所述指令为分支跳转指令,重新生成指令地址,并根据所述指令地址跳转到对应的分支;AIPU模块,配置为响应于跳转到AIPU分支,通过第一级输入特征缓存和第一级系数缓存存储用于当前卷积运算的特征数据和系数数据,并通过第二级输入特征缓存和第二级系数缓存存储下一步卷积运算的特征数据和系 数数据;以及执行模块,配置为根据对应的特征数据和系数数据进行卷积运算,并对运算得到的结果进行激活、归一化和池化。Another aspect of the embodiments of the present application provides a system for data processing based on a RISC-V instruction set, including: an acquisition module configured to acquire instructions from the RISC-V instruction space and cache them in the cache, and determine the the type of the instruction; the jump module, configured to regenerate the instruction address in response to the instruction being a branch jump instruction, and to jump to the corresponding branch according to the instruction address; the AIPU module, configured to jump to the AIPU in response to the instruction Branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and store the next convolution operation through the second-level input feature cache and the second-level coefficient cache. feature data and coefficient data; and an execution module configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.
本申请实施例的又一方面,还提供了一种计算机设备,包括:至少一个处理器;以及存储器,所述存储器存储有可在所述处理器上运行的计算机指令,所述指令由所述处理器执行时实现如上方法的步骤。In yet another aspect of the embodiments of the present application, a computer device is also provided, including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor. The processor implements the steps of the above method when executed.
本申请实施例的再一方面,还提供了一种计算机可读存储介质,计算机可读存储介质存储有被处理器执行时实现如上方法步骤的计算机程序。In another aspect of the embodiments of the present application, a computer-readable storage medium is also provided, where the computer-readable storage medium stores a computer program that implements the above method steps when executed by a processor.
本申请具有以下有益技术效果:通过AIPU(AI process unit,人工智能处理单元)和CPU共享内存,其中,AIPU在所共享的内存中建立所述第一级输入特征缓存、第一级系数缓存以及所述第二级输入特征缓存、第二级系数缓存,使计算与存储相邻,提高内存访问带宽,方便AIPU与CPU的数据交互,减少了与外部总线的数据交互量,减少了对总线带宽的需求。同时AIPU和CPU内部各有一个小的buffer用于缓存输入数据、中间结果、输出数据以及CPU的预读取的指令,允许在数据计算的同时进行数据加载,延长数据读写时间,进一步减少对总线带宽的需求。The present application has the following beneficial technical effects: the AIPU (AI process unit, artificial intelligence processing unit) shares memory with the CPU, wherein the AIPU establishes the first-level input feature cache, the first-level coefficient cache and the first-level coefficient cache in the shared memory. The second-level input feature cache and the second-level coefficient cache make calculation and storage adjacent, improve memory access bandwidth, facilitate data interaction between AIPU and CPU, reduce the amount of data interaction with external buses, and reduce the bandwidth of the bus. demand. At the same time, there is a small buffer inside the AIPU and CPU to cache input data, intermediate results, output data and CPU pre-reading instructions, allowing data to be loaded at the same time as data calculation, prolonging data reading and writing time, and further reducing the need for bus bandwidth requirements.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other embodiments can also be obtained according to these drawings without creative efforts.
图1为本申请提供的基于RISC-V指令集进行数据处理的方法的实施例的示意图;1 is a schematic diagram of an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application;
图2为本申请实施例中CPU架构示意图;2 is a schematic diagram of a CPU architecture in an embodiment of the present application;
图3为本申请提供的AIPU架构的示意图;3 is a schematic diagram of an AIPU architecture provided by the present application;
图4为本申请提供的基于RISC-V指令集进行数据处理的方法的实施例 中卷积运算的示意图;Fig. 4 is the schematic diagram of convolution operation in the embodiment of the method for data processing based on RISC-V instruction set provided by this application;
图5为本申请提供的基于RISC-V指令集进行数据处理的计算机设备的实施例的硬件结构示意图;5 is a schematic diagram of the hardware structure of an embodiment of a computer device for data processing based on a RISC-V instruction set provided by the present application;
图6为本申请提供的基于RISC-V指令集进行数据处理的计算机存储介质的实施例的示意图。FIG. 6 is a schematic diagram of an embodiment of a computer storage medium for data processing based on a RISC-V instruction set provided by the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本申请实施例进一步详细说明。In order to make the objectives, technical solutions and advantages of the present application clearer, the following describes the embodiments of the present application in detail with reference to the accompanying drawings and specific embodiments.
需要说明的是,本申请实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”“第二”仅为了表述的方便,不应理解为对本申请实施例的限定,后续实施例对此不再一一说明。It should be noted that all expressions using "first" and "second" in the embodiments of the present application are for the purpose of distinguishing two entities with the same name but not the same or non-identical parameters. It can be seen that "first" and "second" It is only for the convenience of expression and should not be construed as a limitation on the embodiments of the present application, and subsequent embodiments will not describe them one by one.
基于上述目的,本申请实施例的第一个方面,提出了一种基于RISC-V指令集进行数据处理的方法的实施例。图1示出的是本申请提供的基于RISC-V指令集进行数据处理的方法的实施例的示意图。如图1所示,本申请实施例包括如下步骤:Based on the above purpose, in the first aspect of the embodiments of the present application, an embodiment of a method for data processing based on a RISC-V instruction set is proposed. FIG. 1 shows a schematic diagram of an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application. As shown in Figure 1, the embodiment of the present application includes the following steps:
S1、从RISC-V指令空间中获取指令缓存到缓存中,并判断指令的类型;S1. Acquire the instruction cache from the RISC-V instruction space into the cache, and determine the type of the instruction;
S2、响应于指令为分支跳转指令,重新生成指令地址,并根据指令地址跳转到对应的分支;S2, in response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding branch according to the instruction address;
S3、响应于跳转到AIPU分支,通过第一级输入特征缓存和第一级系数缓存存储用于当前卷积运算的特征数据和系数数据,并通过第二级输入特征缓存和第二级系数缓存存储下一步卷积运算的特征数据和系数数据;以及S3. In response to jumping to the AIPU branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature buffer and the first-level coefficient buffer, and pass the second-level input feature buffer and the second-level coefficient buffer The cache stores feature data and coefficient data for the next convolution operation; and
S4、根据对应的特征数据和系数数据进行卷积运算,并对运算得到的结果进行激活、归一化和池化。S4. Perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.
本申请实施例中采用了存算一体的结构,AIPU和CPU共享内存,其 中,AIPU在所共享的内存中建立所述第一级输入特征缓存、第一级系数缓存以及所述第二级输入特征缓存、第二级系数缓存,使计算与存储相邻,提高内存访问带宽,方便AIPU与CPU的数据交互,减少了与外部总线的数据交互量,减少了对总线带宽的需求。同时AIPU和CPU内部各有一个小的buffer用于缓存输入数据、中间结果、输出数据以及CPU的预读取的指令,允许在数据计算的同时进行数据加载,延长数据读写时间,进一步减少对总线带宽的需求。In the embodiment of the present application, a storage-computing integrated structure is adopted, and the AIPU and the CPU share memory, wherein the AIPU establishes the first-level input feature cache, the first-level coefficient cache, and the second-level input in the shared memory. Feature cache and second-level coefficient cache make computing adjacent to storage, improve memory access bandwidth, facilitate data interaction between AIPU and CPU, reduce the amount of data interaction with external buses, and reduce the demand for bus bandwidth. At the same time, there is a small buffer inside the AIPU and CPU to cache input data, intermediate results, output data and CPU pre-reading instructions, allowing data to be loaded at the same time as data calculation, prolonging data reading and writing time, and further reducing the need for bus bandwidth requirements.
RISC-V指令集包括通用指令集和向量扩展指令集,可以分为:整数指令集I、乘加运算指令集M、原子操作指令集A、单精度指令集F、双精度指令集D、压缩指令集C和向量指令集V。算术逻辑运算单元完成IMAFDC指令集操作,向量运算单元完成向量指令集V操作。CPU架构是根据RISC-V指令集设计的,CPU的功能是运行系统代码,完成系统控制和数据运算。RISC-V instruction set includes general instruction set and vector extended instruction set, which can be divided into: integer instruction set I, multiply-add operation instruction set M, atomic operation instruction set A, single-precision instruction set F, double-precision instruction set D, compression Instruction set C and vector instruction set V. The arithmetic logic operation unit completes the IMAFDC instruction set operation, and the vector operation unit completes the vector instruction set V operation. The CPU architecture is designed according to the RISC-V instruction set. The function of the CPU is to run system code and complete system control and data operations.
图2示出的是本申请实施例中CPU架构示意图。如图2所示,CPU采用了两级流水线架构,第一阶段是取指阶段,负责从指令存储空间里获取指令缓存到指令缓存中。第二阶段将指令进行译码并执行。译码时分析指令的类型(向量指令或普通指令),根据相应的指令类型和操作码启动相应的数据运算,例如向量加指令,则会从向量数据存储中读取数据到向量寄存器,然后在向量运算单元中完成运算,将结果缓存到向量数据缓存中。FIG. 2 shows a schematic diagram of a CPU architecture in an embodiment of the present application. As shown in Figure 2, the CPU adopts a two-stage pipeline architecture. The first stage is the instruction fetch stage, which is responsible for fetching the instruction cache from the instruction storage space into the instruction cache. The second stage decodes and executes the instruction. When decoding, analyze the type of the instruction (vector instruction or ordinary instruction), and start the corresponding data operation according to the corresponding instruction type and opcode. For example, the vector add instruction will read the data from the vector data storage to the vector register, and then in The operation is completed in the vector operation unit, and the result is cached in the vector data cache.
此处设置向量数据缓存的意义是:AI推理计算中,向量操作通常不是独立的,往往需要合理的将多个向量操作用流水线形式完成计算,若将中间结果存储到数据sram(Static Random Access Memory,静态随机存取存储器)中,向量数据可能需要多个周期才能完成存储或读取,这样会大大增加向量计算周期。设置一个向量缓存buffer可以在启动向量计算前将数据提前加载到向量缓存buffer中,向量计算完成后,将最终结果存储到数据sram中。向量数据的预读取和结果存储都可以在进行其他运算期间完成,减少向量运算周期。向量数据缓存模块的端口较宽,满足向量运算单元的带宽需求。The meaning of setting the vector data cache here is: In AI inference calculation, vector operations are usually not independent, and it is often necessary to reasonably complete the calculation with multiple vector operations in the form of pipelines. If the intermediate results are stored in the data sram (Static Random Access Memory , static random access memory), the vector data may take multiple cycles to complete the store or read, which will greatly increase the vector calculation cycle. Setting a vector cache buffer can load data into the vector cache buffer in advance before starting the vector calculation, and store the final result in the data sram after the vector calculation is completed. Both prefetching and result storage of vector data can be done during other operations, reducing vector operation cycles. The port of the vector data cache module is wide to meet the bandwidth requirements of the vector operation unit.
从RISC-V指令空间中获取指令缓存到缓存中,并判断指令是否为分支跳转指令;响应于指令为分支跳转指令,重新生成指令地址,并根据指令地址跳转到对应的分支。当遇到分支跳转指令时,若分支跳转成立(或无条件跳转)时,pc(指令地址)会重新生成,并清除指令缓存中的指令。The instruction is obtained from the RISC-V instruction space and cached in the cache, and it is judged whether the instruction is a branch jump instruction; in response to the instruction being a branch jump instruction, the instruction address is regenerated, and the corresponding branch is jumped according to the instruction address. When encountering a branch jump instruction, if the branch jump is established (or unconditional jump), the pc (instruction address) will be regenerated, and the instruction in the instruction cache will be cleared.
该架构有3个架构分支,分别为:通用架构分支,用于支持通用指令,实现CPU的功能;向量架构分支,用于支持RISC-V向量指令集,完成向量操作;AIPU分支,该分支支持通用的load/store指令,支持自定义用户指令,用于完成卷积、矩阵乘等特殊的密集计算。其中,AIPU分支可以与AIPU架构建立联系。AIPU分支通过CPU的load(加载)/store(存储)指令配置各个功能模块的寄存器,AIPU内部各个功能模块的工作只受寄存器控制,不需要CPU的指令参与,因此计算效率高但不够灵活,适用于特殊的大规模计算。向量架构分支是通过CPU的向量指令控制的,每一步的操作都需要指令控制,由此可见,向量架构分支要比AIPU灵活,但计算效率较低,适用于小批量,多样化的向量计算。由于向量操作涉及的数据较多,如何加快数据的load和store是关键。The architecture has three architecture branches, namely: general architecture branch, which is used to support general-purpose instructions and realize the functions of CPU; vector architecture branch, which is used to support RISC-V vector instruction set and complete vector operations; AIPU branch, which supports General load/store instructions, support custom user instructions, used to complete special intensive calculations such as convolution and matrix multiplication. Among them, the AIPU branch can establish a connection with the AIPU architecture. The AIPU branch configures the registers of each functional module through the load (load)/store (store) instructions of the CPU. The work of each functional module in the AIPU is only controlled by the registers and does not require the participation of CPU instructions. Therefore, the calculation efficiency is high but not flexible enough. for special large-scale computing. The vector architecture branch is controlled by the vector instructions of the CPU, and each step of the operation requires instruction control. It can be seen that the vector architecture branch is more flexible than the AIPU, but the calculation efficiency is lower, and it is suitable for small batches and diversified vector calculations. Since the vector operation involves a lot of data, how to speed up the data load and store is the key.
响应于跳转到AIPU分支,通过第一级输入特征缓存和第一级系数缓存存储用于当前卷积运算的特征数据和系数数据,并通过第二级输入特征缓存和第二级系数缓存存储下一步卷积运算的特征数据和系数数据。输入特征向量缓存和系数向量缓存主要用于缓存乘加运算单元当前时钟周期内所要计算的数据,这些数据都是以向量形式并行计算。由于这些数据不可能在单个周期内从输入特征(或系数)缓存中全部读取出来,需要合理利用卷积中输入特征数据和系数(weight)数据的复用特点,把读数据的过程掩藏在数据计算的过程中,这样整个计算过程不会中断。In response to jumping to the AIPU branch, the feature data and coefficient data for the current convolution operation are stored through the first-level input feature cache and the first-level coefficient cache, and stored through the second-level input feature cache and the second-level coefficient cache Feature data and coefficient data for the next convolution operation. The input feature vector buffer and coefficient vector buffer are mainly used to buffer the data to be calculated in the current clock cycle of the multiply-add operation unit, and these data are all calculated in parallel in the form of vectors. Since it is impossible to read all these data from the input feature (or coefficient) buffer in a single cycle, it is necessary to make reasonable use of the multiplexing characteristics of the input feature data and coefficient (weight) data in the convolution to hide the process of reading data in In the process of data calculation, the whole calculation process will not be interrupted.
图3示出的是本申请提供的AIPU架构的示意图。如图3所示,AIPU架构包括寄存器文件、DMA、读写接口仲裁、地址生成器、卷积时序控制器、向量缓存、乘加运算矩阵、中间结果累加器和特殊向量运算单元。FIG. 3 shows a schematic diagram of the AIPU architecture provided by this application. As shown in Figure 3, the AIPU architecture includes register files, DMA, read and write interface arbitration, address generators, convolution timing controllers, vector caches, multiply-add operation matrices, intermediate result accumulators, and special vector operation units.
AIPU架构的核心是乘加运算矩阵模块,包含大量的乘法和加法硬件资源,能够实现并行的、高速的乘加运算,满足密集型的卷积/矩阵运算的算 力需求;其他的模块是为了使卷积运算更加高效,如介绍的数据复用是为了解决计算时数据需求量大,但是数据总线和SRAM的带宽不够的矛盾,尽量多复用读取到的数据,减少带宽的压力;buffer(缓存)的设置是为了调节buffer前后模块数据吞吐率的不同,减少阻塞的发生,使各个功能模块全速的运转而不阻塞;向量运算单元可以根据卷积算法需求,提供不同的算法支持,这样一来,可以读取数据后完成所用的运算,然后再存储起来,而不是多次读取数据,完成一次完整的卷积计算;地址发生器配合读写控制,可以通过不同的读写数据的顺序实现数据排列,使卷积运算更加高效;AI计算所用的卷积神经网络通常分为许多层,AI推理芯片需要一层层的计算,每一层都有大量的卷积或矩阵运算,有了乒乓寄存器后,就可以在计算本层的同时,配置AIPU下一层计算所需的参数,如数据维度等信息,这样一来,本层结束后,下一层的计算可以立即开始,这就减少了整个神经网络的计算时间,提高的计算效率。The core of the AIPU architecture is the multiplication and addition matrix module, which contains a large number of multiplication and addition hardware resources, which can realize parallel and high-speed multiplication and addition operations to meet the computing power requirements of intensive convolution/matrix operations; other modules are for Make the convolution operation more efficient. For example, the data multiplexing introduced is to solve the problem that the data demand is large during the calculation, but the bandwidth of the data bus and SRAM is not enough. The read data is reused as much as possible to reduce the pressure on the bandwidth; buffer The setting of (cache) is to adjust the data throughput rate of the modules before and after the buffer, reduce the occurrence of blocking, and make each functional module run at full speed without blocking; the vector operation unit can provide different algorithm support according to the requirements of the convolution algorithm, so that First, the data can be read to complete the operation used, and then stored, instead of reading the data multiple times to complete a complete convolution calculation; the address generator cooperates with read and write control, and can read and write data through different read and write data. Sequential data arrangement makes the convolution operation more efficient; the convolutional neural network used in AI computing is usually divided into many layers, and the AI inference chip requires layer-by-layer calculation, each layer has a large number of convolution or matrix operations, including After the ping-pong register is established, the parameters required for the calculation of the next layer of AIPU, such as data dimensions and other information, can be configured while calculating the current layer. In this way, after the end of this layer, the calculation of the next layer can start immediately. It reduces the computing time of the entire neural network and improves the computing efficiency.
图4示出的是本申请提供的基于RISC-V指令集进行数据处理的方法的实施例中卷积运算的示意图。如图4所示,乘加运算矩阵一次计算中,f0向量块同时与w0…w7做乘加运算(向量块包含多个向量元素,乘加运算即对应的向量元素相乘然后将所有元素的乘积累加起来,累加和即输出结果)。将f0和w0…w7对应到乘加矩阵上,f0相当于复制了8次,分别与w0…w7向量块做乘加运算。同理,f1…f7都需要与w0…w7做乘加运算。在这个过程中,f0…f7复用了w0…w7向量块,而每个w向量块复用了同一个f向量块。因此,在这8次矩阵运算中,只需要取一次w0…w7,然后每次计算都读取一个f向量块。8次运算需要用8个时钟周期,读取w0…w7也需要8个周期,因此可以将读取w向量块的过程隐藏在计算过程中(即读取数据过程完全和计算过程重叠,不需要中断数据计算过程等待数据读取)。这就是需要设置输入特征向量缓存和系数向量缓存的原因。FIG. 4 shows a schematic diagram of a convolution operation in an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application. As shown in Figure 4, in one calculation of the multiplication and addition operation matrix, the f0 vector block is simultaneously multiplied and added with w0...w7 (the vector block contains multiple vector elements, and the multiplication and addition operation is to multiply the corresponding vector elements and then multiply the elements of all elements. Multiply and accumulate, and the accumulated sum is the output result). Corresponding f0 and w0...w7 to the multiplication and addition matrix, f0 is equivalent to 8 times of copying, and do multiplication and addition operations with w0...w7 vector blocks respectively. In the same way, f1...f7 all need to do multiplication and addition with w0...w7. In this process, f0...f7 multiplex w0...w7 vector blocks, and each w vector block multiplexes the same f vector block. Therefore, in these 8 matrix operations, it is only necessary to take w0...w7 once, and then read a block of f vectors for each calculation. 8 operations require 8 clock cycles, and reading w0...w7 also requires 8 cycles, so the process of reading the w vector block can be hidden in the calculation process (that is, the process of reading data completely overlaps with the calculation process, no need Interrupt the data calculation process and wait for the data to be read). This is why the input feature vector buffer and coefficient vector buffer need to be set up.
中间结果缓存用于缓存向量计算的中间结果,根据卷积原理,一次向量乘加运算并不能得到最终结果,需要将多次的乘加的结果累加起来。因此,在乘加结果累加器后面设置了缓存。当中间结果经过不断的累加后得 到了完整的最终结果时,需要将完整的结果存储到完整结果缓存buffer里。这个缓存buffer有多个作用:The intermediate result cache is used to cache the intermediate results of vector calculations. According to the convolution principle, a vector multiplication and addition operation cannot obtain the final result, and the results of multiple multiplications and additions need to be accumulated. Therefore, a cache is set after the multiply-accumulate result accumulator. When the intermediate results are continuously accumulated to obtain the complete final result, the complete result needs to be stored in the complete result cache buffer. This cache buffer has multiple functions:
1.避免数据被后面的中间结果覆盖掉;1. Avoid data being overwritten by subsequent intermediate results;
2.该缓存buffer由后面的激活模块、池化模块等共享,用于存放这些计算模块的输入数据和输出数据;2. The cache buffer is shared by the subsequent activation modules, pooling modules, etc., and is used to store the input data and output data of these computing modules;
3.该模块带有总线读写控制,将最终的计算数据发送到DMA接口。3. The module has a bus read and write control to send the final calculation data to the DMA interface.
在一些实施方式中,方法还包括:将所述AIPU分支中的寄存器文件配置成两部分,第一部分运行当前AIPU运算,第二部分获取AIPU下一步运算所需要的参数。寄存器文件可以在添加编译器后端时设置为系统寄存器,通过load指令加载配置信息。寄存器文件配置为两部分,执行乒乓操作,即:第一部分在控制AIPU运算过程时,第二部分接受AIPU下一步计算所需要的参数,当第一部分运算完成后,第二部分寄存器转换为当前可用寄存器。这样可以保证AIPU连续不间断的工作。In some embodiments, the method further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation. The register file can be set as a system register when adding a compiler backend, and the configuration information is loaded by the load instruction. The register file is configured into two parts and performs ping-pong operation, that is: when the first part controls the AIPU operation process, the second part accepts the parameters required for the next calculation of the AIPU, and when the first part of the operation is completed, the second part of the register is converted to the currently available register. register. This can ensure continuous and uninterrupted work of the AIPU.
寄存器文件配置和转换原理如下:由于在编译器后端描述芯片架构时,两组寄存器会被添加到编译器后端,编译器会根据架构中的寄存器描述寻找相应的寄存器。如load r0,addr是把address处的数据加载到寄存器0里,load r1,addr会把数据加载到寄存器1里。但是AIPU使用寄存器时需要判断哪个寄存器可用,这时需要“计算完成”信号来交替使能寄存器0和寄存器1。在编程时,使能一次AIPU计算后,需要紧接着配置另外一个寄存器,为下一次启动AIPU做准备。The principle of register file configuration and conversion is as follows: Since two sets of registers are added to the back end of the compiler when the chip architecture is described in the back end of the compiler, the compiler will find the corresponding registers according to the description of the registers in the architecture. For example, load r0, addr loads the data at address into register 0, load r1, addr loads the data into register 1. However, when the AIPU uses registers, it needs to determine which register is available. At this time, a "calculation complete" signal is required to alternately enable register 0 and register 1. During programming, after enabling an AIPU calculation, another register needs to be configured immediately to prepare for the next AIPU startup.
在一些实施方式中,方法还包括:响应于所述指令为加载或存储指令,根据源操作数中的地址,读取存储空间的地址到目的操作数中。RISC-V的指令通常有一个或两个源操作数rs1、rs2,对应的向量源操作数为vs1、vs2。指令根据操作码(代表着计算类型,如加、减、乘、除等操作)将源操作数送到相应的执行单元中(包括数据加载、数据存储、标量计算、向量计算等)。例如,当操作码代表着load/store时,说明指令是访问存储指令,执行单元根据rs1中的地址,读取数据存储空间的地址到目的操作数(rd 或vd)中。In some embodiments, the method further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand. RISC-V instructions usually have one or two source operands rs1 and rs2, and the corresponding vector source operands are vs1 and vs2. The instruction sends the source operand to the corresponding execution unit (including data load, data storage, scalar calculation, vector calculation, etc.) according to the opcode (representing the type of calculation, such as addition, subtraction, multiplication, division, etc.). For example, when the opcode represents load/store, it indicates that the instruction is an access storage instruction, and the execution unit reads the address of the data storage space into the destination operand (rd or vd) according to the address in rs1.
在一些实施方式中,方法还包括:判断向量源操作数对应的向量寄存器是否在同一个组内;以及响应于所述向量源操作数对应的向量寄存器不在同一个组内,使两个与所述向量寄存器位宽相同的端口同时进行读写。为了加快向量的load和store,可以设置多个与向量寄存器位宽相同的端口。如,有32个向量寄存器,硬件上把这32个向量寄存器分为4组,每组对应一个端口。根据vsetvli指令设置的向量组数使能相应端口,如vsetvli t0,a0,e8,m4指令设置了4个向量寄存器一组,则软件把32个寄存器分成了8组,与硬件的对应关系就是2个软件向量组对应一个硬件向量组。若计算时向量寄存器vs1,vs2在同一组内,则只使能一个端口读写,若在两组里,则使能两个端口同时读写。In some embodiments, the method further includes: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time. In order to speed up the load and store of the vector, you can set up multiple ports with the same width as the vector register. For example, there are 32 vector registers, the hardware divides these 32 vector registers into 4 groups, each group corresponds to a port. Enable the corresponding port according to the number of vector groups set by the vsetvli instruction. For example, if the vsetvli t0, a0, e8, m4 instructions set a group of 4 vector registers, the software divides the 32 registers into 8 groups, and the corresponding relationship with the hardware is 2 Each software vector group corresponds to one hardware vector group. If the vector registers vs1 and vs2 are in the same group during calculation, only one port is enabled to read and write, and if they are in two groups, two ports are enabled to read and write at the same time.
在一些实施方式中,方法还包括:响应于跳转到向量架构分支,根据所述指令进行向量操作。In some embodiments, the method further includes performing a vector operation according to the instruction in response to the jump to the vector architecture branch.
在一些实施方式中,方法还包括:根据运算的需求,读取与运算对应的数据并对所述与运算对应的数据进行维度转换,并将转换后的数据写入对应的系数缓存或输入特征缓存。根据AIPU的架构,系数缓存单元和输入特征缓存单元都需要从共享的外部SRAM中读取weight和feature数据,地址发生器根据寄存器配置生成相应的SRAM地址。卷积计算或矩阵运算根据不同的应用会有不同的计算方式,如卷积计算又分为一维/二维/三维卷积、空洞卷积、depthwise卷积、分离式卷积、转置卷积等。不同的计算方式读取数据的方式也不同,卷积计算通常还会对数据的维度做相应的变换,这就需要地址生成器根据寄存器的配置以不同的方式读取数据,变相的完成这些转换。由此可见,地址发生器及读写数据控制的功能为:根据不同的计算需求,完成数据的读取并作相应的维度转换,然后写入相应的系数(weight)缓存单元或输入特征(feature)缓存单元中。In some embodiments, the method further includes: according to the requirements of the operation, reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation, and writing the converted data into the corresponding coefficient buffer or input feature cache. According to the architecture of the AIPU, both the coefficient cache unit and the input feature cache unit need to read the weight and feature data from the shared external SRAM, and the address generator generates the corresponding SRAM address according to the register configuration. Convolution calculation or matrix operation will have different calculation methods according to different applications. For example, convolution calculation is divided into one-dimensional/two-dimensional/three-dimensional convolution, hole convolution, depthwise convolution, separated convolution, and transposed volume. accumulate and so on. Different calculation methods read data in different ways. Convolution calculation usually also transforms the dimension of the data accordingly, which requires the address generator to read data in different ways according to the configuration of the register, and complete these conversions in disguise . It can be seen that the functions of the address generator and read and write data control are: according to different calculation requirements, complete the reading of data and make corresponding dimension conversion, and then write the corresponding coefficient (weight) buffer unit or input feature (feature) ) in the cache unit.
卷积时序控制单元是整个AIPU的控制核心,负责收集各个功能模块的状态,控制相关模块的使能,产生卷积运算的同步信号。卷积同步信号是整个卷积过程的节拍。整个卷积过程包含N(N>=1)个节拍,一个节拍包 含M(>=1)个时钟周期,乘加操作和累加操作的周期为一个时钟周期。因此,一个节拍包含M此乘加运算和累加运算。M的大小由卷积过程中数据复用的次数决定。例如同一组系数复用了8次,则M的最小值为8(若计算周期数足够加载下一组数据,则M就是计算周期;否则,M需要额外的时间加载下一组的数据)。由于卷积计算与数据加载是同步进行的,因此数据加载的同步信号经过固定的读写数据周期的延时,就是卷积计算的同步信号。同理,累加器的同步信号是卷积同步信号经过固定的乘加运算周期的延时后的信号。The convolution timing control unit is the control core of the entire AIPU, which is responsible for collecting the status of each functional module, controlling the enabling of related modules, and generating the synchronization signal of the convolution operation. The convolution sync signal is the beat of the entire convolution process. The whole convolution process includes N (N>=1) beats, one beat contains M (>=1) clock cycles, and the cycle of multiply-add operation and accumulation operation is one clock cycle. Therefore, a beat contains M of these multiply-add and accumulate operations. The size of M is determined by the number of times of data multiplexing in the convolution process. For example, if the same set of coefficients is multiplexed 8 times, the minimum value of M is 8 (if the number of calculation cycles is enough to load the next set of data, then M is the calculation cycle; otherwise, M needs extra time to load the next set of data). Since the convolution calculation and data loading are performed synchronously, the synchronization signal of the data loading is the synchronization signal of the convolution calculation after a fixed read and write data cycle delay. Similarly, the synchronization signal of the accumulator is the delayed signal of the convolution synchronization signal after a fixed multiply-add operation cycle.
在一些实施方式中,方法还包括:响应于进行卷积计算,读取所述第一级输入特征缓存和第一级系数缓存中的数据,并判断所述第一级输入特征缓存和第一级系数缓存剩余的空间是否大于下一组数据的大小;以及响应于所述第一级输入特征缓存和第一级系数缓存剩余的空间大于下一组数据的大小,开启写缓存。输入特征缓存和系数缓存的剩余可用存储空间是由读写数据两个过程共同决定的,数据写入使剩余可用空间减少,数据读出使剩余可用空间增大。卷积时序控制器根据读写缓存的次数计算缓存的可用空间大小,当两个缓存中的数据足够启动卷积运算时(如系数数据满足复用数目的需求,输入特征数据满足多次计算的需求,且计算时间大于等于下一步计算所需的数据加载时间),启动卷积使能。卷积计算时,不断的读取输入特征缓存和系数缓存,使两个缓存的剩余空间逐渐增大,当剩余空间大于下一组数据的大小时,启动写缓存使能。因此,如果下一组数据的加载时间小于上一组的卷积计算时间,则卷积计算会不间断的运行。如果计算快,数据加载慢,则卷积计算过程则会有间断。In some embodiments, the method further includes: in response to performing the convolution calculation, reading data in the first-level input feature buffer and the first-level coefficient buffer, and determining the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching. The remaining available storage space of the input feature cache and coefficient cache is determined by the two processes of reading and writing data. Data writing reduces the remaining available space, and data reading increases the remaining available space. The convolution timing controller calculates the available space size of the cache according to the number of times of reading and writing the cache. When the data in the two caches is enough to start the convolution operation (for example, the coefficient data meets the requirement of the number of multiplexing, and the input feature data meets the requirement of multiple computations). requirements, and the calculation time is greater than or equal to the data loading time required for the next calculation), enable convolution. During the convolution calculation, the input feature buffer and coefficient buffer are continuously read, so that the remaining space of the two buffers gradually increases. When the remaining space is larger than the size of the next set of data, the write cache is enabled. Therefore, if the load time of the next set of data is less than the convolution calculation time of the previous set, the convolution calculation will run uninterrupted. If the calculation is fast and the data loading is slow, there will be interruptions in the convolution calculation process.
根据对应的特征数据和系数数据进行卷积运算,并对运算得到的结果进行激活、归一化和池化。卷积运算时,乘加计算完成后需要对数据进行激活(如Relu)、归一化和池化等操作。若使用AIPU外部的计算效率较慢的向量运算单元,则会出现大量的乘加运算中间结果堆积在向量运算之前,等待激活或池化,整个卷积运算效率就会被向量运算单元拉低。因此,将卷积需要的激活等向量运算专用化,放置在乘加矩阵单元之后。激活等 专用向量计算单元可以与乘加单元和累加单元串联,也可以单独工作,中间结果缓存单元被这些专用向量计算单元共享。The convolution operation is performed according to the corresponding feature data and coefficient data, and the results obtained by the operation are activated, normalized and pooled. In the convolution operation, the data needs to be activated (such as Relu), normalized, and pooled after the multiplication and addition calculation is completed. If a vector operation unit with slow computational efficiency outside the AIPU is used, a large number of intermediate results of multiplication and addition operations will be accumulated before the vector operation, waiting for activation or pooling, and the efficiency of the entire convolution operation will be lowered by the vector operation unit. Therefore, vector operations such as activations required for convolution are specialized and placed after the multiply-add matrix unit. Activation, etc. The dedicated vector calculation unit can be connected in series with the multiply-accumulate unit and the accumulation unit, or can work independently, and the intermediate result cache unit is shared by these dedicated vector calculation units.
本申请实施例的关键点:Key points of the embodiments of the present application:
(1)根据RISC-V指令集,设计了三条指令分支的处理器架构,分别为:通用指令分支,向量指令分支,AIPU分支;(1) According to the RISC-V instruction set, the processor architecture of three instruction branches is designed, namely: general instruction branch, vector instruction branch, and AIPU branch;
(2)设计了AIPU架构,AIPU是以加速器的形式与RISC-V架构结合在一起,有专用的寄存器文件,通过RISC-V的load/store指令配置,用于加速卷积、矩阵运算;(2) The AIPU architecture is designed. The AIPU is combined with the RISC-V architecture in the form of an accelerator. It has a dedicated register file and is configured by the RISC-V load/store instruction to accelerate convolution and matrix operations;
(3)设计了AIPU乘加运算阵列的架构,是一种二维并行乘加运算单元。设有两个向量缓存buffer,配合前面的输入特征缓存buffer和系数缓存buffer,组成两级双buffer。前级双buffer(目的是使后续的单元有接连不断的数据)是输入特征缓存和系数缓存以及卷积控制单元共同组成的,利用剩余空间实时监测的方法,在数据不断的读出buffer的同时,把下一步所需的数据写入buffer。后级buffer是为了实现增大带宽、数据复用的功能。(3) The architecture of the AIPU multiply-add operation array is designed, which is a two-dimensional parallel multiply-add operation unit. There are two vector cache buffers, which cooperate with the previous input feature cache buffer and coefficient cache buffer to form a two-level double buffer. The front-level double buffer (the purpose is to make the subsequent units have continuous data) is composed of the input feature buffer, coefficient buffer and convolution control unit. Using the method of real-time monitoring of the remaining space, the data is continuously read out of the buffer at the same time. , write the data required for the next step into the buffer. The back-end buffer is to realize the functions of increasing bandwidth and data multiplexing.
(4)AIPU中各个缓存buffer的设计,合理的缓冲卷积运算各级功能模块之间的数据吞吐速率的不同;(4) The design of each cache buffer in the AIPU, and the difference in the data throughput rate between the functional modules at all levels of the buffer convolution operation is reasonable;
(5)设计了灵活的地址发生器:地址发生器根据寄存器的配置,搭配后级的buffer,在读取数据的同时完成数据维度的变换;(5) A flexible address generator is designed: according to the configuration of the register, the address generator is matched with the buffer of the latter stage to complete the transformation of the data dimension while reading the data;
(6)设计了乒乓操作寄存器,保证前后两种不同的卷积操作不间断的运行。(6) The ping-pong operation register is designed to ensure the uninterrupted operation of the two different convolution operations before and after.
本申请实施例中的架构应用非常灵活,既有通用CPU的控制功能,又有AI所需的算力。可以应用于人工只能物联网的边缘端机器。也可以通过片上互联网络(NoC)实现更大的算力,以加速卡的形式安装在PC或服务器上,实现云端推理或训练。The application of the architecture in the embodiment of the present application is very flexible, and has both the control function of a general-purpose CPU and the computing power required by AI. It can be applied to edge-end machines of artificial intelligence and IoT. It can also achieve greater computing power through the Internet-on-Chip (NoC), and install it on a PC or server in the form of an accelerator card to realize cloud-based reasoning or training.
需要特别指出的是,上述基于RISC-V指令集进行数据处理的方法的各个实施例中的各个步骤均可以相互交叉、替换、增加、删减,因此,这些合理的排列组合变换之于基于RISC-V指令集进行数据处理的方法也应当 属于本申请的保护范围,并且不应将本申请的保护范围局限在实施例之上。It should be particularly pointed out that the steps in the above-mentioned various embodiments of the method for data processing based on the RISC-V instruction set can be intersected, replaced, added, and deleted. The data processing method of the -V instruction set should also belong to the protection scope of the present application, and the protection scope of the present application should not be limited to the embodiments.
基于上述目的,本申请实施例的第二个方面,提出了一种基于RISC-V指令集进行数据处理的系统,包括:获取模块,配置为从RISC-V指令空间中获取指令缓存到缓存中,并判断所述指令的类型;跳转模块,配置为响应于所述指令为分支跳转指令,重新生成指令地址,并根据所述指令地址跳转到对应的分支;AIPU模块,配置为响应于跳转到AIPU分支,通过第一级输入特征缓存和第一级系数缓存存储用于当前卷积运算的特征数据和系数数据,并通过第二级输入特征缓存和第二级系数缓存存储下一步卷积运算的特征数据和系数数据;以及执行模块,配置为根据对应的特征数据和系数数据进行卷积运算,并对运算得到的结果进行激活、归一化和池化。Based on the above purpose, in a second aspect of the embodiments of the present application, a system for data processing based on the RISC-V instruction set is proposed, including: an acquisition module configured to acquire an instruction cache from the RISC-V instruction space and store it in the cache , and determine the type of the instruction; the jump module is configured to regenerate the instruction address in response to the instruction being a branch jump instruction, and jump to the corresponding branch according to the instruction address; the AIPU module is configured to respond It jumps to the AIPU branch, stores the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and stores the next level through the second-level input feature cache and the second-level coefficient cache. feature data and coefficient data of a one-step convolution operation; and an execution module configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.
在一些实施方式中,系统还包括向量模块,配置为:响应于跳转到向量架构分支,根据所述指令进行向量操作。In some embodiments, the system further includes a vector module configured to perform vector operations according to the instructions in response to jumping to the vector architecture branch.
在一些实施方式中,系统还包括第一判断模块,配置为:响应于所述指令为加载或存储指令,根据源操作数中的地址,读取存储空间的地址到目的操作数中。In some embodiments, the system further includes a first judgment module configured to: in response to the instruction being a load or store instruction, read the address of the storage space into the destination operand according to the address in the source operand.
在一些实施方式中,系统还包括第二判断模块,配置为:判断向量源操作数对应的向量寄存器是否在同一个组内;以及响应于所述向量源操作数对应的向量寄存器不在同一个组内,使两个与所述向量寄存器位宽相同的端口同时进行读写。In some embodiments, the system further includes a second judgment module configured to: judge whether the vector registers corresponding to the vector source operands are in the same group; and respond that the vector registers corresponding to the vector source operands are not in the same group Inside, two ports with the same bit width as the vector register can be read and written at the same time.
在一些实施方式中,系统还包括配置模块,配置:将所述AIPU分支中的寄存器文件配置成两部分,第一部分运行当前AIPU运算,第二部分获取AIPU下一步运算所需要的参数。In some embodiments, the system further includes a configuration module configured to configure the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
在一些实施方式中,系统还包括转换模块,配置:根据运算的需求,读取与所述运算对应的数据并对所述与所述运算对应的数据进行维度转换,并将转换后的数据写入对应的系数缓存或输入特征缓存。In some embodiments, the system further includes a conversion module configured to read data corresponding to the operation and perform dimension conversion on the data corresponding to the operation according to the requirements of the operation, and write the converted data into into the corresponding coefficient buffer or input feature buffer.
在一些实施方式中,系统还包括计算模块,配置:响应于进行卷积计算,读取所述第一级输入特征缓存和第一级系数缓存中的数据,并判断所 述第一级输入特征缓存和第一级系数缓存剩余的空间是否大于下一组数据的大小;以及响应于所述第一级输入特征缓存和第一级系数缓存剩余的空间大于下一组数据的大小,开启写缓存。In some embodiments, the system further includes a computing module configured to read data in the first-level input feature buffer and the first-level coefficient buffer in response to performing the convolution calculation, and determine the first-level input feature Whether the remaining space of the cache and the first-level coefficient cache is larger than the size of the next set of data; and in response to the first-level input feature cache and the remaining space of the first-level coefficient cache being larger than the size of the next set of data, enable write caching .
基于上述目的,本申请实施例的第三个方面,提出了一种计算机设备,包括:至少一个处理器;以及存储器,存储器存储有可在处理器上运行的计算机指令,指令由处理器执行以实现如下步骤:S1、从RISC-V指令空间中获取指令缓存到缓存中,并判断指令的类型;S2、响应于指令为分支跳转指令,重新生成指令地址,并根据指令地址跳转到对应的分支;S3、响应于跳转到AIPU分支,通过第一级输入特征缓存和第一级系数缓存存储用于当前卷积运算的特征数据和系数数据,并通过第二级输入特征缓存和第二级系数缓存存储下一步卷积运算的特征数据和系数数据;以及S4、根据对应的特征数据和系数数据进行卷积运算,并对运算得到的结果进行激活、归一化和池化。Based on the above purpose, in a third aspect of the embodiments of the present application, a computer device is proposed, including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor to The following steps are implemented: S1. Obtain the instruction from the RISC-V instruction space and cache it in the cache, and determine the type of the instruction; S2. In response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding instruction address according to the instruction address. branch; S3, in response to jumping to the AIPU branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and pass the second-level input feature cache and the first-level coefficient cache. The second-level coefficient cache stores the feature data and coefficient data of the next convolution operation; and S4, performs convolution operation according to the corresponding feature data and coefficient data, and activates, normalizes and pools the result obtained by the operation.
在一些实施方式中,步骤还包括:响应于跳转到向量架构分支,根据所述指令进行向量操作。In some embodiments, the steps further comprise: in response to jumping to the vector architecture branch, performing a vector operation according to the instruction.
在一些实施方式中,步骤还包括:响应于所述指令为加载或存储指令,根据源操作数中的地址,读取存储空间的地址到目的操作数中。In some embodiments, the step further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.
在一些实施方式中,步骤还包括:判断向量源操作数对应的向量寄存器是否在同一个组内;以及响应于所述向量源操作数对应的向量寄存器不在同一个组内,使两个与所述向量寄存器位宽相同的端口同时进行读写。In some embodiments, the steps further include: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.
在一些实施方式中,步骤还包括:将所述AIPU分支中的寄存器文件配置成两部分,第一部分运行当前AIPU运算,第二部分获取AIPU下一步运算所需要的参数。In some embodiments, the step further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
在一些实施方式中,步骤还包括:根据运算的需求,读取与所述运算对应的数据并对所述与所述运算对应的数据进行维度转换,并将转换后的数据写入对应的系数缓存或输入特征缓存。In some embodiments, the step further includes: reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation according to the requirements of the operation, and writing the converted data into the corresponding coefficients Cache or input feature cache.
在一些实施方式中,步骤还包括:响应于进行卷积计算,读取所述第 一级输入特征缓存和第一级系数缓存中的数据,并判断所述第一级输入特征缓存和第一级系数缓存剩余的空间是否大于下一组数据的大小;以及响应于所述第一级输入特征缓存和第一级系数缓存剩余的空间大于下一组数据的大小,开启写缓存。In some embodiments, the step further includes: in response to performing the convolution calculation, reading the data in the first-level input feature buffer and the first-level coefficient buffer, and judging the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.
如图5所示,为本申请提供的上述基于RISC-V指令集进行数据处理的计算机设备的一个实施例的硬件结构示意图。As shown in FIG. 5 , it is a schematic diagram of the hardware structure of an embodiment of the above-mentioned computer device for data processing based on the RISC-V instruction set provided for this application.
以如图5所示的装置为例,在该装置中包括一个处理器201以及一个存储器202,并还可以包括:输入装置203和输出装置204。Taking the device shown in FIG. 5 as an example, the device includes a processor 201 and a memory 202 , and may also include an input device 203 and an output device 204 .
处理器201、存储器202、输入装置203和输出装置204可以通过总线或者其他方式连接,图5中以通过总线连接为例。The processor 201 , the memory 202 , the input device 203 and the output device 204 may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 5 .
存储器202作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的基于RISC-V指令集进行数据处理的方法对应的程序指令/模块。处理器201通过运行存储在存储器202中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例的基于RISC-V指令集进行数据处理的方法。As a non-volatile computer-readable storage medium, the memory 202 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules. The program instruction/module corresponding to the data processing method. The processor 201 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 202, that is, the data processing based on the RISC-V instruction set of the above method embodiments is implemented. Methods.
存储器202可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据基于RISC-V指令集进行数据处理的方法的使用所创建的数据等。此外,存储器202可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器202可选包括相对于处理器201远程设置的存储器,这些远程存储器可以通过网络连接至本地模块。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required by at least one function; the storage data area may store the use of the method for data processing based on the RISC-V instruction set created data, etc. Additionally, memory 202 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 202 may optionally include memory located remotely from processor 201, which may be connected to local modules via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
输入装置203可接收输入的用户名和密码等信息。输出装置204可包括显示屏等显示设备。The input device 203 can receive input information such as user name and password. The output device 204 may include a display device such as a display screen.
一个或者多个基于RISC-V指令集进行数据处理的方法对应的程序指 令/模块存储在存储器202中,当被处理器201执行时,执行上述任意方法实施例中的基于RISC-V指令集进行数据处理的方法。One or more program instructions/modules corresponding to the method for data processing based on the RISC-V instruction set are stored in the memory 202, and when executed by the processor 201, the execution based on the RISC-V instruction set in any of the above method embodiments is performed. method of data processing.
执行上述基于RISC-V指令集进行数据处理的方法的计算机设备的任何一个实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。Any embodiment of a computer device that executes the above-mentioned method for data processing based on a RISC-V instruction set can achieve the same or similar effects as any of the foregoing method embodiments corresponding to it.
本申请还提供了一种计算机可读存储介质,计算机可读存储介质存储有被处理器执行时执行如上方法的计算机程序。The present application also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program that executes the above method when executed by a processor.
如图6所示,为本申请提供的上述基于RISC-V指令集进行数据处理的计算机存储介质的一个实施例的示意图。以如图6所示的计算机存储介质为例,计算机可读存储介质3存储有被处理器执行时执行如上方法的计算机程序31。As shown in FIG. 6 , it is a schematic diagram of an embodiment of the above-mentioned computer storage medium for data processing based on the RISC-V instruction set provided for this application. Taking the computer storage medium shown in FIG. 6 as an example, the computer readable storage medium 3 stores a computer program 31 that executes the above method when executed by the processor.
最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,基于RISC-V指令集进行数据处理的方法的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,程序的存储介质可为磁碟、光盘、只读存储记忆体(ROM)或随机存储记忆体(RAM)等。上述计算机程序的实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。Finally, it should be noted that those of ordinary skill in the art can understand that all or part of the process in the method of the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program. The program of the method for data processing based on the RISC-V instruction set can be Stored in a computer-readable storage medium, when the program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, the storage medium of the program may be a magnetic disk, an optical disk, a read only memory (ROM) or a random access memory (RAM) or the like. The above computer program embodiments can achieve the same or similar effects as any of the foregoing method embodiments corresponding thereto.
以上是本申请公开的示例性实施例,但是应当注意,在不背离权利要求限定的本申请实施例公开的范围的前提下,可以进行多种改变和修改。根据这里描述的公开实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本申请实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。The above are exemplary embodiments disclosed in the present application, but it should be noted that various changes and modifications may be made without departing from the scope of the disclosure of the embodiments of the present application defined by the claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements disclosed in the embodiments of the present application may be described or claimed in an individual form, unless explicitly limited to the singular, they may also be construed as a plurality.
应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。It should be understood that, as used herein, the singular form "a" is intended to include the plural form as well, unless the context clearly supports an exception. It will also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
上述本申请实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned embodiments of the present application disclose the serial numbers of the embodiments only for description, and do not represent the advantages and disadvantages of the embodiments.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above embodiments can be completed by hardware, or can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium. The storage medium can be a read-only memory, a magnetic disk or an optical disk, and the like.
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本申请实施例公开的范围被限于这些例子;在本申请实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上的本申请实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。因此,凡在本申请实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。Those of ordinary skill in the art should understand that the discussion of any of the above embodiments is only exemplary, and is not intended to imply that the scope of the disclosure of the embodiments of the present application is limited to these examples; under the idea of the embodiments of the present application, the above embodiments or Combinations of technical features in different embodiments are also possible, and there are many other variations of different aspects of the embodiments of the present application as above, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present application should be included within the protection scope of the embodiments of the present application.

Claims (10)

  1. 一种基于RISC-V指令集进行数据处理的方法,其特征在于,包括以下步骤:A method for data processing based on a RISC-V instruction set, characterized in that it comprises the following steps:
    从RISC-V指令空间中获取指令缓存到缓存中,并判断所述指令的类型;Obtain the instruction cache from the RISC-V instruction space into the cache, and determine the type of the instruction;
    响应于所述指令为分支跳转指令,重新生成指令地址,并根据所述指令地址跳转到对应的分支;In response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding branch according to the instruction address;
    响应于跳转到AIPU分支,通过第一级输入特征缓存和第一级系数缓存存储用于当前卷积运算的特征数据和系数数据,并通过第二级输入特征缓存和第二级系数缓存存储下一步卷积运算的特征数据和系数数据;以及In response to jumping to the AIPU branch, the feature data and coefficient data for the current convolution operation are stored through the first-level input feature cache and the first-level coefficient cache, and stored through the second-level input feature cache and the second-level coefficient cache Feature data and coefficient data for the next convolution operation; and
    根据对应的特征数据和系数数据进行卷积运算,并对运算得到的结果进行激活、归一化和池化。The convolution operation is performed according to the corresponding feature data and coefficient data, and the results obtained by the operation are activated, normalized and pooled.
  2. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    响应于跳转到向量架构分支,根据所述指令进行向量操作。In response to jumping to the vector architecture branch, vector operations are performed according to the instruction.
  3. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    响应于所述指令为加载或存储指令,根据源操作数中的地址,读取存储空间的地址到目的操作数中。In response to the instruction being a load or store instruction, the address in the storage space is read into the destination operand according to the address in the source operand.
  4. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    判断向量源操作数对应的向量寄存器是否在同一个组内;以及Determine whether the vector registers corresponding to the vector source operands are in the same group; and
    响应于所述向量源操作数对应的向量寄存器不在同一个组内,使两个与所述向量寄存器位宽相同的端口同时进行读写。In response to the vector registers corresponding to the vector source operands not being in the same group, two ports with the same bit width as the vector registers are simultaneously read and written.
  5. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    将所述AIPU分支中的寄存器文件配置成两部分,第一部分运行当前AIPU运算,第二部分获取AIPU下一步运算所需要的参数。The register file in the AIPU branch is configured into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
  6. 根据权利要求5所述的方法,其特征在于,还包括:The method of claim 5, further comprising:
    根据运算的需求,读取与所述运算对应的数据并对所述与所述运算对应 的数据进行维度转换,并将转换后的数据写入对应的系数缓存或输入特征缓存。According to the requirements of the operation, read the data corresponding to the operation and perform dimension conversion on the data corresponding to the operation, and write the converted data into the corresponding coefficient buffer or input feature buffer.
  7. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    响应于进行卷积计算,读取所述第一级输入特征缓存和第一级系数缓存中的数据,并判断所述第一级输入特征缓存和第一级系数缓存剩余的空间是否大于下一组数据的大小;In response to performing the convolution calculation, read the data in the first-level input feature cache and the first-level coefficient cache, and determine whether the remaining space of the first-level input feature cache and the first-level coefficient cache is larger than the next level. the size of the group data;
    响应于所述第一级输入特征缓存和第一级系数缓存剩余的空间大于下一组数据的大小,开启写缓存。In response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, the write cache is enabled.
  8. 一种基于RISC-V指令集进行数据处理的系统,其特征在于,包括:A system for data processing based on the RISC-V instruction set, characterized in that it includes:
    获取模块,配置为从RISC-V指令空间中获取指令缓存到缓存中,并判断所述指令的类型;an acquisition module, configured to acquire instructions from the RISC-V instruction space and cache them in the cache, and determine the type of the instructions;
    跳转模块,配置为响应于所述指令为分支跳转指令,重新生成指令地址,并根据所述指令地址跳转到对应的分支;a jump module, configured to regenerate an instruction address in response to the instruction being a branch jump instruction, and jump to a corresponding branch according to the instruction address;
    AIPU模块,配置为响应于跳转到AIPU分支,通过第一级输入特征缓存和第一级系数缓存存储用于当前卷积运算的特征数据和系数数据,并通过第二级输入特征缓存和第二级系数缓存存储下一步卷积运算的特征数据和系数数据;以及The AIPU module is configured to, in response to jumping to the AIPU branch, store feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and store feature data and coefficient data for the current convolution operation through the second-level input feature cache and the first-level coefficient cache The second-level coefficient cache stores feature data and coefficient data for the next convolution operation; and
    执行模块,配置为根据对应的特征数据和系数数据进行卷积运算,并对运算得到的结果进行激活、归一化和池化。The execution module is configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.
  9. 一种计算机设备,其特征在于,包括:A computer equipment, characterized in that, comprising:
    至少一个处理器;以及at least one processor; and
    存储器,所述存储器存储有可在所述处理器上运行的计算机指令,所述指令由所述处理器执行时实现权利要求1-7任意一项所述方法的步骤。a memory storing computer instructions executable on the processor, the instructions implementing the steps of the method of any one of claims 1-7 when executed by the processor.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-7任意一项所述方法的步骤。A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the steps of the method of any one of claims 1-7 are implemented.
PCT/CN2022/074414 2021-02-09 2022-01-27 Data processing method and system based on risc-v instruction set, and device and medium WO2022170997A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110175746.6 2021-02-09
CN202110175746.6A CN112860320A (en) 2021-02-09 2021-02-09 Method, system, device and medium for data processing based on RISC-V instruction set

Publications (1)

Publication Number Publication Date
WO2022170997A1 true WO2022170997A1 (en) 2022-08-18

Family

ID=75989351

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074414 WO2022170997A1 (en) 2021-02-09 2022-01-27 Data processing method and system based on risc-v instruction set, and device and medium

Country Status (2)

Country Link
CN (1) CN112860320A (en)
WO (1) WO2022170997A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801147A (en) * 2022-11-30 2023-03-14 珠海笛思科技有限公司 Data communication processing method and system

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set
CN113254391B (en) * 2021-06-25 2021-11-02 之江实验室 Neural network accelerator convolution calculation and data loading parallel method and device
CN113642722A (en) * 2021-07-15 2021-11-12 深圳供电局有限公司 Chip for convolution calculation, control method thereof and electronic device
CN114399034B (en) * 2021-12-30 2023-05-02 北京奕斯伟计算技术股份有限公司 Data handling method for direct memory access device
CN115113933B (en) * 2022-08-25 2022-11-15 旋智电子科技(上海)有限公司 Apparatus for accelerating data operation
CN115248701B (en) * 2022-09-21 2022-12-20 进迭时空(杭州)科技有限公司 Zero-copy data transmission device and method between processor register files
CN115576606B (en) * 2022-11-16 2023-03-21 苏州浪潮智能科技有限公司 Method for realizing matrix transposition multiplication, coprocessor, server and storage medium
CN116149554B (en) * 2023-02-08 2023-11-24 珠海妙存科技有限公司 RISC-V and extended instruction based data storage processing system and method thereof
CN116804915B (en) * 2023-08-28 2023-12-15 腾讯科技(深圳)有限公司 Data interaction method, processor, device and medium based on memory

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740749A (en) * 2017-10-30 2019-05-10 北京深鉴智能科技有限公司 The hardware realization apparatus and method that the full connection of high speed calculates
CN110007961A (en) * 2019-02-01 2019-07-12 中山大学 A kind of edge calculations hardware structure based on RISC-V
CN111656367A (en) * 2017-12-04 2020-09-11 优创半导体科技有限公司 System and architecture for neural network accelerator
US20210011653A1 (en) * 2019-07-08 2021-01-14 Canon Kabushiki Kaisha Operation processing apparatus, operation processing method, and non-transitory computer-readable storage medium
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1516001A (en) * 2003-01-08 2004-07-28 上海海尔集成电路有限公司 New-type RISC pieline microcontroller structure and its operation method
CN100555225C (en) * 2008-03-17 2009-10-28 中国科学院计算技术研究所 A kind of risc processor device and method of supporting the X86 virtual machine
CN106940815B (en) * 2017-02-13 2020-07-28 西安交通大学 Programmable convolutional neural network coprocessor IP core
CN108647773B (en) * 2018-04-20 2021-07-23 复旦大学 Hardware interconnection system capable of reconstructing convolutional neural network
CN110659069B (en) * 2018-06-28 2022-08-19 赛灵思公司 Instruction scheduling method for performing neural network computation and corresponding computing system
CN111191774B (en) * 2018-11-14 2023-04-07 上海富瀚微电子股份有限公司 Simplified convolutional neural network-oriented low-cost accelerator architecture and processing method thereof
CN111078287B (en) * 2019-11-08 2022-07-19 苏州浪潮智能科技有限公司 Vector operation co-processing method and device
CN111160545A (en) * 2019-12-31 2020-05-15 北京三快在线科技有限公司 Artificial neural network processing system and data processing method thereof
CN111582465B (en) * 2020-05-08 2023-04-07 中国科学院上海高等研究院 Convolutional neural network acceleration processing system and method based on FPGA and terminal
CN112130901A (en) * 2020-09-11 2020-12-25 山东云海国创云计算装备产业创新中心有限公司 RISC-V based coprocessor, data processing method and storage medium
CN112232517B (en) * 2020-09-24 2022-05-31 苏州浪潮智能科技有限公司 Artificial intelligence accelerates engine and artificial intelligence treater

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740749A (en) * 2017-10-30 2019-05-10 北京深鉴智能科技有限公司 The hardware realization apparatus and method that the full connection of high speed calculates
CN111656367A (en) * 2017-12-04 2020-09-11 优创半导体科技有限公司 System and architecture for neural network accelerator
CN110007961A (en) * 2019-02-01 2019-07-12 中山大学 A kind of edge calculations hardware structure based on RISC-V
US20210011653A1 (en) * 2019-07-08 2021-01-14 Canon Kabushiki Kaisha Operation processing apparatus, operation processing method, and non-transitory computer-readable storage medium
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801147A (en) * 2022-11-30 2023-03-14 珠海笛思科技有限公司 Data communication processing method and system
CN115801147B (en) * 2022-11-30 2023-09-22 珠海笛思科技有限公司 Data communication processing method and system

Also Published As

Publication number Publication date
CN112860320A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
WO2022170997A1 (en) Data processing method and system based on risc-v instruction set, and device and medium
CN111897579B (en) Image data processing method, device, computer equipment and storage medium
JP6977239B2 (en) Matrix multiplier
CN109740747B (en) Operation method, device and Related product
KR102606825B1 (en) Neural network system reshaping neural network model, Application processor having the same and Operating method of neural network system
KR20210082058A (en) Configurable processor element arrays for implementing convolutional neural networks
CN112612521A (en) Apparatus and method for performing matrix multiplication operation
KR102610842B1 (en) Processing element and operating method thereof in neural network
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
WO2022134729A1 (en) Risc-v-based artificial intelligence inference method and system
CN112799726A (en) Data processing device, method and related product
CN111860773B (en) Processing apparatus and method for information processing
Wen et al. Taso: Time and space optimization for memory-constrained DNN inference
Wang et al. SOLAR: Services-oriented deep learning architectures-deep learning as a service
CN112051981B (en) Data pipeline calculation path structure and single-thread data pipeline system
Haghi et al. Flash: FPGA-accelerated smart switches with GCN case study
US20190272460A1 (en) Configurable neural network processor for machine learning workloads
Song et al. Gpnpu: Enabling efficient hardware-based direct convolution with multi-precision support in gpu tensor cores
Gottlieb et al. Clustered programmable-reconfigurable processors
de Dinechin et al. Deep learning inference on the mppa3 manycore processor
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
CN109189475A (en) The construction method of programmable artificial intelligence accelerator instruction set
Gupta et al. Accelerating CNN inference on long vector architectures via co-design
CN114327639A (en) Accelerator based on data flow architecture, and data access method and equipment of accelerator
CN102446086A (en) Parameterized specific instruction set processor design platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22752158

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22752158

Country of ref document: EP

Kind code of ref document: A1