WO2022170997A1 - Procédé et système de traitement de données basés sur un ensemble d'instructions risc-v, et dispositif et support - Google Patents

Procédé et système de traitement de données basés sur un ensemble d'instructions risc-v, et dispositif et support Download PDF

Info

Publication number
WO2022170997A1
WO2022170997A1 PCT/CN2022/074414 CN2022074414W WO2022170997A1 WO 2022170997 A1 WO2022170997 A1 WO 2022170997A1 CN 2022074414 W CN2022074414 W CN 2022074414W WO 2022170997 A1 WO2022170997 A1 WO 2022170997A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
instruction
cache
coefficient
vector
Prior art date
Application number
PCT/CN2022/074414
Other languages
English (en)
Chinese (zh)
Inventor
贾兆荣
Original Assignee
山东英信计算机技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东英信计算机技术有限公司 filed Critical 山东英信计算机技术有限公司
Publication of WO2022170997A1 publication Critical patent/WO2022170997A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3814Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of data processing, and more particularly, to a method, system, computer device and readable medium for data processing based on the RISC-V instruction set.
  • the value of data lies in analysis and utilization, not simple storage.
  • the amount of data is constantly growing, and it is impossible to transmit all data to the cloud through the network, and the speed of bandwidth growth is slower than the speed of data growth.
  • we need to judge the data at the edge such as autonomous driving, unmanned driving and other fields.
  • For scenarios with high privacy protection requirements such as medical information or data that users are unwilling to share in the cloud, it needs to be stored locally.
  • most of the data generated by security equipment is useless or data that has no potential to be tapped. It is a waste of bandwidth to transmit all data to the cloud.
  • intelligent analysis is performed at the edge, only useful or potential data is transmitted to the cloud. Greatly saves network bandwidth. Therefore, the transfer of data processing from the cloud to the edge is an inevitable trend. Therefore, edge-end AI (artificial intelligence, artificial intelligence) chips are also the general trend.
  • AI chip Artificial intelligence processing at the edge requires AI chips, and the challenges faced by AI chips are mainly computing power and computing efficiency.
  • the computing power of an AI chip is determined by the number of on-chip computing units. Since the amount of data involved in AI computing is very large, in theory, the larger the computing power of an AI chip, the better, but in fact, the computing power of an AI chip is restricted by various factors:
  • On-chip storage bandwidth and bus bandwidth The main contradiction of AI chips is the contradiction between storage bandwidth and computing power. The greater the computing power, the greater the amount of input data, intermediate results, and output data, and the higher the required storage bandwidth. However, the current storage bandwidth is far from meeting the computing power requirements. If the computing units and storage units cannot be reasonably arranged, it will lead to There is a large but inefficient result.
  • a deep neural network model usually consists of multiple layers, and the output of the previous layer is the input of the next layer; in the same layer, the result of the multiplication and addition operation is often the input of activation, pooling, and normalization. Therefore, if multi-threading/parallel computing/computation pipeline cannot be implemented reasonably, the calculation of the previous step will hinder the calculation of the next step, causing waste of resources and reducing computing efficiency.
  • AI chip As mentioned in 2, there are various operators involved in AI, but the AI chip is fixed. How to make the unchanged hardware efficiently handle the variable operators requires that the software can be reasonably based on the hardware architecture. Allocate hardware resources and compile efficient machine code. At the same time, AI chips are also required to have efficient control capabilities.
  • the purpose of the embodiments of the present application is to propose a method, system, computer equipment and computer-readable storage medium for data processing based on RISC-V instruction set, through AIPU (AI process unit, artificial intelligence processing unit) and CPU shares memory, making computing and storage adjacent, improving memory access bandwidth, facilitating data interaction between AIPU and CPU, reducing the amount of data interaction with external buses, and reducing the demand for bus bandwidth.
  • AIPU and the CPU each have a small buffer (cache) used to cache input data, intermediate results, output data and CPU pre-reading instructions, allowing data to be loaded at the same time as data calculation and prolonging data read and write time. Further reducing the need for bus bandwidth.
  • an aspect of the embodiments of the present application provides a method for data processing based on a RISC-V instruction set, including the following steps: acquiring an instruction from the RISC-V instruction space and caching it in the cache, and judging the instruction type; in response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding branch according to the instruction address; in response to jumping to the AIPU branch, through the first level input feature cache and the first
  • the first-level coefficient cache stores the feature data and coefficient data for the current convolution operation, and stores the feature data and coefficient data of the next convolution operation through the second-level input feature cache and the second-level coefficient cache; and according to the corresponding feature data Perform convolution operation with coefficient data, and activate, normalize and pool the result obtained by operation.
  • the method further includes performing a vector operation according to the instruction in response to the jump to the vector architecture branch.
  • the method further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.
  • the method further includes: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.
  • the method further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
  • the method further includes: reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation according to the requirements of the operation, and writing the converted data into the corresponding coefficients Cache or input feature cache.
  • the method further includes: in response to performing the convolution calculation, reading data in the first-level input feature buffer and the first-level coefficient buffer, and determining the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.
  • Another aspect of the embodiments of the present application provides a system for data processing based on a RISC-V instruction set, including: an acquisition module configured to acquire instructions from the RISC-V instruction space and cache them in the cache, and determine the type of the instruction; the jump module, configured to regenerate the instruction address in response to the instruction being a branch jump instruction, and to jump to the corresponding branch according to the instruction address; the AIPU module, configured to jump to the AIPU in response to the instruction Branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and store the next convolution operation through the second-level input feature cache and the second-level coefficient cache. feature data and coefficient data; and an execution module configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.
  • a computer device including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor.
  • the processor implements the steps of the above method when executed.
  • a computer-readable storage medium stores a computer program that implements the above method steps when executed by a processor.
  • the AIPU AI process unit, artificial intelligence processing unit
  • shares memory with the CPU wherein the AIPU establishes the first-level input feature cache, the first-level coefficient cache and the first-level coefficient cache in the shared memory.
  • the second-level input feature cache and the second-level coefficient cache make calculation and storage adjacent, improve memory access bandwidth, facilitate data interaction between AIPU and CPU, reduce the amount of data interaction with external buses, and reduce the bandwidth of the bus. demand.
  • there is a small buffer inside the AIPU and CPU to cache input data, intermediate results, output data and CPU pre-reading instructions allowing data to be loaded at the same time as data calculation, prolonging data reading and writing time, and further reducing the need for bus bandwidth requirements.
  • FIG. 1 is a schematic diagram of an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application
  • FIG. 2 is a schematic diagram of a CPU architecture in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an AIPU architecture provided by the present application.
  • Fig. 4 is the schematic diagram of convolution operation in the embodiment of the method for data processing based on RISC-V instruction set provided by this application;
  • FIG. 5 is a schematic diagram of the hardware structure of an embodiment of a computer device for data processing based on a RISC-V instruction set provided by the present application;
  • FIG. 6 is a schematic diagram of an embodiment of a computer storage medium for data processing based on a RISC-V instruction set provided by the present application.
  • FIG. 1 shows a schematic diagram of an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application.
  • the embodiment of the present application includes the following steps:
  • a storage-computing integrated structure is adopted, and the AIPU and the CPU share memory, wherein the AIPU establishes the first-level input feature cache, the first-level coefficient cache, and the second-level input in the shared memory.
  • Feature cache and second-level coefficient cache make computing adjacent to storage, improve memory access bandwidth, facilitate data interaction between AIPU and CPU, reduce the amount of data interaction with external buses, and reduce the demand for bus bandwidth.
  • there is a small buffer inside the AIPU and CPU to cache input data, intermediate results, output data and CPU pre-reading instructions allowing data to be loaded at the same time as data calculation, prolonging data reading and writing time, and further reducing the need for bus bandwidth requirements.
  • RISC-V instruction set includes general instruction set and vector extended instruction set, which can be divided into: integer instruction set I, multiply-add operation instruction set M, atomic operation instruction set A, single-precision instruction set F, double-precision instruction set D, compression Instruction set C and vector instruction set V.
  • the arithmetic logic operation unit completes the IMAFDC instruction set operation
  • the vector operation unit completes the vector instruction set V operation.
  • the CPU architecture is designed according to the RISC-V instruction set. The function of the CPU is to run system code and complete system control and data operations.
  • FIG. 2 shows a schematic diagram of a CPU architecture in an embodiment of the present application.
  • the CPU adopts a two-stage pipeline architecture.
  • the first stage is the instruction fetch stage, which is responsible for fetching the instruction cache from the instruction storage space into the instruction cache.
  • the second stage decodes and executes the instruction.
  • decoding analyze the type of the instruction (vector instruction or ordinary instruction), and start the corresponding data operation according to the corresponding instruction type and opcode.
  • the vector add instruction will read the data from the vector data storage to the vector register, and then in The operation is completed in the vector operation unit, and the result is cached in the vector data cache.
  • vector data cache In AI inference calculation, vector operations are usually not independent, and it is often necessary to reasonably complete the calculation with multiple vector operations in the form of pipelines. If the intermediate results are stored in the data sram (Static Random Access Memory , static random access memory), the vector data may take multiple cycles to complete the store or read, which will greatly increase the vector calculation cycle. Setting a vector cache buffer can load data into the vector cache buffer in advance before starting the vector calculation, and store the final result in the data sram after the vector calculation is completed. Both prefetching and result storage of vector data can be done during other operations, reducing vector operation cycles. The port of the vector data cache module is wide to meet the bandwidth requirements of the vector operation unit.
  • the instruction is obtained from the RISC-V instruction space and cached in the cache, and it is judged whether the instruction is a branch jump instruction; in response to the instruction being a branch jump instruction, the instruction address is regenerated, and the corresponding branch is jumped according to the instruction address.
  • the branch jump is established (or unconditional jump)
  • the pc instruction address
  • the architecture has three architecture branches, namely: general architecture branch, which is used to support general-purpose instructions and realize the functions of CPU; vector architecture branch, which is used to support RISC-V vector instruction set and complete vector operations; AIPU branch, which supports General load/store instructions, support custom user instructions, used to complete special intensive calculations such as convolution and matrix multiplication.
  • the AIPU branch can establish a connection with the AIPU architecture.
  • the AIPU branch configures the registers of each functional module through the load (load)/store (store) instructions of the CPU.
  • the work of each functional module in the AIPU is only controlled by the registers and does not require the participation of CPU instructions. Therefore, the calculation efficiency is high but not flexible enough. for special large-scale computing.
  • the vector architecture branch is controlled by the vector instructions of the CPU, and each step of the operation requires instruction control. It can be seen that the vector architecture branch is more flexible than the AIPU, but the calculation efficiency is lower, and it is suitable for small batches and diversified vector calculations. Since the vector operation involves a lot of data, how to speed up the data load and store is the key.
  • the feature data and coefficient data for the current convolution operation are stored through the first-level input feature cache and the first-level coefficient cache, and stored through the second-level input feature cache and the second-level coefficient cache Feature data and coefficient data for the next convolution operation.
  • the input feature vector buffer and coefficient vector buffer are mainly used to buffer the data to be calculated in the current clock cycle of the multiply-add operation unit, and these data are all calculated in parallel in the form of vectors.
  • FIG. 3 shows a schematic diagram of the AIPU architecture provided by this application.
  • the AIPU architecture includes register files, DMA, read and write interface arbitration, address generators, convolution timing controllers, vector caches, multiply-add operation matrices, intermediate result accumulators, and special vector operation units.
  • the core of the AIPU architecture is the multiplication and addition matrix module, which contains a large number of multiplication and addition hardware resources, which can realize parallel and high-speed multiplication and addition operations to meet the computing power requirements of intensive convolution/matrix operations; other modules are for Make the convolution operation more efficient.
  • the data multiplexing introduced is to solve the problem that the data demand is large during the calculation, but the bandwidth of the data bus and SRAM is not enough.
  • the read data is reused as much as possible to reduce the pressure on the bandwidth; buffer
  • the setting of (cache) is to adjust the data throughput rate of the modules before and after the buffer, reduce the occurrence of blocking, and make each functional module run at full speed without blocking;
  • the vector operation unit can provide different algorithm support according to the requirements of the convolution algorithm, so that First, the data can be read to complete the operation used, and then stored, instead of reading the data multiple times to complete a complete convolution calculation;
  • the address generator cooperates with read and write control, and can read and write data through different read and write data.
  • Sequential data arrangement makes the convolution operation more efficient; the convolutional neural network used in AI computing is usually divided into many layers, and the AI inference chip requires layer-by-layer calculation, each layer has a large number of convolution or matrix operations, including After the ping-pong register is established, the parameters required for the calculation of the next layer of AIPU, such as data dimensions and other information, can be configured while calculating the current layer. In this way, after the end of this layer, the calculation of the next layer can start immediately. It reduces the computing time of the entire neural network and improves the computing efficiency.
  • FIG. 4 shows a schematic diagram of a convolution operation in an embodiment of a method for data processing based on a RISC-V instruction set provided by the present application.
  • the f0 vector block is simultaneously multiplied and added with w0...w7 (the vector block contains multiple vector elements, and the multiplication and addition operation is to multiply the corresponding vector elements and then multiply the elements of all elements. Multiply and accumulate, and the accumulated sum is the output result).
  • f0 and w0...w7 to the multiplication and addition matrix, f0 is equivalent to 8 times of copying, and do multiplication and addition operations with w0...w7 vector blocks respectively.
  • f1...f7 all need to do multiplication and addition with w0...w7.
  • f0...f7 multiplex w0...w7 vector blocks, and each w vector block multiplexes the same f vector block. Therefore, in these 8 matrix operations, it is only necessary to take w0...w7 once, and then read a block of f vectors for each calculation. 8 operations require 8 clock cycles, and reading w0...w7 also requires 8 cycles, so the process of reading the w vector block can be hidden in the calculation process (that is, the process of reading data completely overlaps with the calculation process, no need Interrupt the data calculation process and wait for the data to be read). This is why the input feature vector buffer and coefficient vector buffer need to be set up.
  • the intermediate result cache is used to cache the intermediate results of vector calculations. According to the convolution principle, a vector multiplication and addition operation cannot obtain the final result, and the results of multiple multiplications and additions need to be accumulated. Therefore, a cache is set after the multiply-accumulate result accumulator. When the intermediate results are continuously accumulated to obtain the complete final result, the complete result needs to be stored in the complete result cache buffer.
  • This cache buffer has multiple functions:
  • the cache buffer is shared by the subsequent activation modules, pooling modules, etc., and is used to store the input data and output data of these computing modules;
  • the module has a bus read and write control to send the final calculation data to the DMA interface.
  • the method further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
  • the register file can be set as a system register when adding a compiler backend, and the configuration information is loaded by the load instruction.
  • the register file is configured into two parts and performs ping-pong operation, that is: when the first part controls the AIPU operation process, the second part accepts the parameters required for the next calculation of the AIPU, and when the first part of the operation is completed, the second part of the register is converted to the currently available register. register. This can ensure continuous and uninterrupted work of the AIPU.
  • register file configuration and conversion The principle of register file configuration and conversion is as follows: Since two sets of registers are added to the back end of the compiler when the chip architecture is described in the back end of the compiler, the compiler will find the corresponding registers according to the description of the registers in the architecture. For example, load r0, addr loads the data at address into register 0, load r1, addr loads the data into register 1. However, when the AIPU uses registers, it needs to determine which register is available. At this time, a "calculation complete" signal is required to alternately enable register 0 and register 1. During programming, after enabling an AIPU calculation, another register needs to be configured immediately to prepare for the next AIPU startup.
  • the method further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.
  • RISC-V instructions usually have one or two source operands rs1 and rs2, and the corresponding vector source operands are vs1 and vs2.
  • the instruction sends the source operand to the corresponding execution unit (including data load, data storage, scalar calculation, vector calculation, etc.) according to the opcode (representing the type of calculation, such as addition, subtraction, multiplication, division, etc.).
  • the opcode represents load/store, it indicates that the instruction is an access storage instruction, and the execution unit reads the address of the data storage space into the destination operand (rd or vd) according to the address in rs1.
  • the method further includes: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.
  • the software divides the 32 registers into 8 groups, and the corresponding relationship with the hardware is 2 Each software vector group corresponds to one hardware vector group. If the vector registers vs1 and vs2 are in the same group during calculation, only one port is enabled to read and write, and if they are in two groups, two ports are enabled to read and write at the same time.
  • the method further includes performing a vector operation according to the instruction in response to the jump to the vector architecture branch.
  • the method further includes: according to the requirements of the operation, reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation, and writing the converted data into the corresponding coefficient buffer or input feature cache.
  • both the coefficient cache unit and the input feature cache unit need to read the weight and feature data from the shared external SRAM, and the address generator generates the corresponding SRAM address according to the register configuration.
  • Convolution calculation or matrix operation will have different calculation methods according to different applications. For example, convolution calculation is divided into one-dimensional/two-dimensional/three-dimensional convolution, hole convolution, depthwise convolution, separated convolution, and transposed volume. accumulate and so on. Different calculation methods read data in different ways.
  • Convolution calculation usually also transforms the dimension of the data accordingly, which requires the address generator to read data in different ways according to the configuration of the register, and complete these conversions in disguise . It can be seen that the functions of the address generator and read and write data control are: according to different calculation requirements, complete the reading of data and make corresponding dimension conversion, and then write the corresponding coefficient (weight) buffer unit or input feature (feature) ) in the cache unit.
  • the convolution timing control unit is the control core of the entire AIPU, which is responsible for collecting the status of each functional module, controlling the enabling of related modules, and generating the synchronization signal of the convolution operation.
  • the convolution sync signal is the beat of the entire convolution process.
  • the size of M is determined by the number of times of data multiplexing in the convolution process.
  • the synchronization signal of the data loading is the synchronization signal of the convolution calculation after a fixed read and write data cycle delay.
  • the synchronization signal of the accumulator is the delayed signal of the convolution synchronization signal after a fixed multiply-add operation cycle.
  • the method further includes: in response to performing the convolution calculation, reading data in the first-level input feature buffer and the first-level coefficient buffer, and determining the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.
  • the remaining available storage space of the input feature cache and coefficient cache is determined by the two processes of reading and writing data. Data writing reduces the remaining available space, and data reading increases the remaining available space.
  • the convolution timing controller calculates the available space size of the cache according to the number of times of reading and writing the cache.
  • the data in the two caches is enough to start the convolution operation (for example, the coefficient data meets the requirement of the number of multiplexing, and the input feature data meets the requirement of multiple computations).
  • requirements, and the calculation time is greater than or equal to the data loading time required for the next calculation) enable convolution.
  • the input feature buffer and coefficient buffer are continuously read, so that the remaining space of the two buffers gradually increases.
  • the write cache is enabled. Therefore, if the load time of the next set of data is less than the convolution calculation time of the previous set, the convolution calculation will run uninterrupted. If the calculation is fast and the data loading is slow, there will be interruptions in the convolution calculation process.
  • the convolution operation is performed according to the corresponding feature data and coefficient data, and the results obtained by the operation are activated, normalized and pooled.
  • the data needs to be activated (such as Relu), normalized, and pooled after the multiplication and addition calculation is completed.
  • a vector operation unit with slow computational efficiency outside the AIPU is used, a large number of intermediate results of multiplication and addition operations will be accumulated before the vector operation, waiting for activation or pooling, and the efficiency of the entire convolution operation will be lowered by the vector operation unit. Therefore, vector operations such as activations required for convolution are specialized and placed after the multiply-add matrix unit. Activation, etc.
  • the dedicated vector calculation unit can be connected in series with the multiply-accumulate unit and the accumulation unit, or can work independently, and the intermediate result cache unit is shared by these dedicated vector calculation units.
  • the processor architecture of three instruction branches is designed, namely: general instruction branch, vector instruction branch, and AIPU branch;
  • the AIPU architecture is designed.
  • the AIPU is combined with the RISC-V architecture in the form of an accelerator. It has a dedicated register file and is configured by the RISC-V load/store instruction to accelerate convolution and matrix operations;
  • the architecture of the AIPU multiply-add operation array is designed, which is a two-dimensional parallel multiply-add operation unit.
  • the front-level double buffer (the purpose is to make the subsequent units have continuous data) is composed of the input feature buffer, coefficient buffer and convolution control unit. Using the method of real-time monitoring of the remaining space, the data is continuously read out of the buffer at the same time. , write the data required for the next step into the buffer.
  • the back-end buffer is to realize the functions of increasing bandwidth and data multiplexing.
  • a flexible address generator is designed: according to the configuration of the register, the address generator is matched with the buffer of the latter stage to complete the transformation of the data dimension while reading the data;
  • the ping-pong operation register is designed to ensure the uninterrupted operation of the two different convolution operations before and after.
  • the application of the architecture in the embodiment of the present application is very flexible, and has both the control function of a general-purpose CPU and the computing power required by AI. It can be applied to edge-end machines of artificial intelligence and IoT. It can also achieve greater computing power through the Internet-on-Chip (NoC), and install it on a PC or server in the form of an accelerator card to realize cloud-based reasoning or training.
  • NoC Internet-on-Chip
  • a system for data processing based on the RISC-V instruction set including: an acquisition module configured to acquire an instruction cache from the RISC-V instruction space and store it in the cache , and determine the type of the instruction; the jump module is configured to regenerate the instruction address in response to the instruction being a branch jump instruction, and jump to the corresponding branch according to the instruction address; the AIPU module is configured to respond It jumps to the AIPU branch, stores the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and stores the next level through the second-level input feature cache and the second-level coefficient cache.
  • feature data and coefficient data of a one-step convolution operation and an execution module configured to perform a convolution operation according to the corresponding feature data and coefficient data, and activate, normalize and pool the results obtained by the operation.
  • system further includes a vector module configured to perform vector operations according to the instructions in response to jumping to the vector architecture branch.
  • the system further includes a first judgment module configured to: in response to the instruction being a load or store instruction, read the address of the storage space into the destination operand according to the address in the source operand.
  • the system further includes a second judgment module configured to: judge whether the vector registers corresponding to the vector source operands are in the same group; and respond that the vector registers corresponding to the vector source operands are not in the same group Inside, two ports with the same bit width as the vector register can be read and written at the same time.
  • system further includes a configuration module configured to configure the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
  • the system further includes a conversion module configured to read data corresponding to the operation and perform dimension conversion on the data corresponding to the operation according to the requirements of the operation, and write the converted data into into the corresponding coefficient buffer or input feature buffer.
  • a conversion module configured to read data corresponding to the operation and perform dimension conversion on the data corresponding to the operation according to the requirements of the operation, and write the converted data into into the corresponding coefficient buffer or input feature buffer.
  • the system further includes a computing module configured to read data in the first-level input feature buffer and the first-level coefficient buffer in response to performing the convolution calculation, and determine the first-level input feature Whether the remaining space of the cache and the first-level coefficient cache is larger than the size of the next set of data; and in response to the first-level input feature cache and the remaining space of the first-level coefficient cache being larger than the size of the next set of data, enable write caching .
  • a computing module configured to read data in the first-level input feature buffer and the first-level coefficient buffer in response to performing the convolution calculation, and determine the first-level input feature Whether the remaining space of the cache and the first-level coefficient cache is larger than the size of the next set of data; and in response to the first-level input feature cache and the remaining space of the first-level coefficient cache being larger than the size of the next set of data, enable write caching .
  • a computer device including: at least one processor; and a memory, where the memory stores computer instructions that can be executed on the processor, and the instructions are executed by the processor to The following steps are implemented: S1. Obtain the instruction from the RISC-V instruction space and cache it in the cache, and determine the type of the instruction; S2. In response to the instruction being a branch jump instruction, regenerate the instruction address, and jump to the corresponding instruction address according to the instruction address. branch; S3, in response to jumping to the AIPU branch, store the feature data and coefficient data for the current convolution operation through the first-level input feature cache and the first-level coefficient cache, and pass the second-level input feature cache and the first-level coefficient cache. The second-level coefficient cache stores the feature data and coefficient data of the next convolution operation; and S4, performs convolution operation according to the corresponding feature data and coefficient data, and activates, normalizes and pools the result obtained by the operation.
  • the steps further comprise: in response to jumping to the vector architecture branch, performing a vector operation according to the instruction.
  • the step further includes: in response to the instruction being a load or store instruction, reading the address in the storage space into the destination operand according to the address in the source operand.
  • the steps further include: judging whether the vector registers corresponding to the vector source operands are in the same group; and in response to the vector registers corresponding to the vector source operands not being in the same group, making two Ports with the same bit width of the vector registers can be read and written at the same time.
  • the step further includes: configuring the register file in the AIPU branch into two parts, the first part runs the current AIPU operation, and the second part obtains the parameters required by the AIPU for the next operation.
  • the step further includes: reading data corresponding to the operation and performing dimension transformation on the data corresponding to the operation according to the requirements of the operation, and writing the converted data into the corresponding coefficients Cache or input feature cache.
  • the step further includes: in response to performing the convolution calculation, reading the data in the first-level input feature buffer and the first-level coefficient buffer, and judging the first-level input feature buffer and the first-level coefficient buffer. whether the remaining space of the first-level coefficient cache is larger than the size of the next group of data; and in response to the remaining space of the first-level input feature cache and the first-level coefficient cache being larger than the size of the next group of data, enable write caching.
  • FIG. 5 it is a schematic diagram of the hardware structure of an embodiment of the above-mentioned computer device for data processing based on the RISC-V instruction set provided for this application.
  • the device includes a processor 201 and a memory 202 , and may also include an input device 203 and an output device 204 .
  • the processor 201 , the memory 202 , the input device 203 and the output device 204 may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 5 .
  • the memory 202 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules.
  • the program instruction/module corresponding to the data processing method.
  • the processor 201 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 202, that is, the data processing based on the RISC-V instruction set of the above method embodiments is implemented. Methods.
  • the memory 202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required by at least one function; the storage data area may store the use of the method for data processing based on the RISC-V instruction set created data, etc. Additionally, memory 202 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 202 may optionally include memory located remotely from processor 201, which may be connected to local modules via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the input device 203 can receive input information such as user name and password.
  • the output device 204 may include a display device such as a display screen.
  • One or more program instructions/modules corresponding to the method for data processing based on the RISC-V instruction set are stored in the memory 202, and when executed by the processor 201, the execution based on the RISC-V instruction set in any of the above method embodiments is performed. method of data processing.
  • Any embodiment of a computer device that executes the above-mentioned method for data processing based on a RISC-V instruction set can achieve the same or similar effects as any of the foregoing method embodiments corresponding to it.
  • the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program that executes the above method when executed by a processor.
  • FIG. 6 it is a schematic diagram of an embodiment of the above-mentioned computer storage medium for data processing based on the RISC-V instruction set provided for this application.
  • the computer readable storage medium 3 stores a computer program 31 that executes the above method when executed by the processor.
  • the program of the method for data processing based on the RISC-V instruction set can be Stored in a computer-readable storage medium, when the program is executed, it may include the processes of the above-mentioned method embodiments.
  • the storage medium of the program may be a magnetic disk, an optical disk, a read only memory (ROM) or a random access memory (RAM) or the like.
  • the above computer program embodiments can achieve the same or similar effects as any of the foregoing method embodiments corresponding thereto.
  • the storage medium can be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Advance Control (AREA)

Abstract

La présente demande divulgue un procédé et un système de traitement de données basés sur un ensemble d'instructions RISC-V, ainsi qu'un dispositif et un support de stockage. Le procédé consiste : à acquérir une instruction à partir d'un espace d'instructions RISC-V, à mettre en cache l'instruction dans une mémoire cache, et déterminer le type de l'instruction ; en réponse à l'instruction qui est une instruction de saut de branche, à régénérer une adresse d'instruction, et à sauter à une branche correspondante selon l'adresse d'instruction ; en réponse à un saut vers une branche AIPU, à stocker, au moyen d'une mémoire cache de caractéristiques d'entrée de premier étage et d'une mémoire cache de coefficients de premier étage, des données de caractéristiques et des données de coefficients qui sont utilisées pour l'opération de convolution actuelle, et à stocker, au moyen d'une mémoire cache de caractéristiques d'entrée de second étage et d'une mémoire cache de coefficients de second étage, des données de caractéristiques et des données de coefficients qui sont utilisées pour l'opération de convolution suivante ; et à réaliser une opération de convolution selon les données de caractéristiques et les données de coefficients correspondantes, et à réaliser une activation, une normalisation et un regroupement sur un résultat obtenu au moyen de l'opération. Au moyen de la présente invention, une architecture de processeur avec trois branches d'instruction est conçue selon un ensemble d'instructions RISC-V, de manière à réaliser une commande générale, un fonctionnement vectoriel, une convolution et un calcul d'accélération de matrice. La présente invention est appropriée pour une puce d'inférence à IA d'un terminal.
PCT/CN2022/074414 2021-02-09 2022-01-27 Procédé et système de traitement de données basés sur un ensemble d'instructions risc-v, et dispositif et support WO2022170997A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110175746.6 2021-02-09
CN202110175746.6A CN112860320A (zh) 2021-02-09 2021-02-09 基于risc-v指令集进行数据处理的方法、系统、设备及介质

Publications (1)

Publication Number Publication Date
WO2022170997A1 true WO2022170997A1 (fr) 2022-08-18

Family

ID=75989351

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074414 WO2022170997A1 (fr) 2021-02-09 2022-01-27 Procédé et système de traitement de données basés sur un ensemble d'instructions risc-v, et dispositif et support

Country Status (2)

Country Link
CN (1) CN112860320A (fr)
WO (1) WO2022170997A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801147A (zh) * 2022-11-30 2023-03-14 珠海笛思科技有限公司 数据通信处理方法及系统

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860320A (zh) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 基于risc-v指令集进行数据处理的方法、系统、设备及介质
CN113254391B (zh) * 2021-06-25 2021-11-02 之江实验室 一种神经网络加速器卷积计算和数据载入并行方法及装置
CN113642722A (zh) * 2021-07-15 2021-11-12 深圳供电局有限公司 用于卷积计算的芯片及其控制方法、电子装置
CN114399034B (zh) * 2021-12-30 2023-05-02 北京奕斯伟计算技术股份有限公司 用于直接存储器访问装置的数据搬运方法
CN115113933B (zh) * 2022-08-25 2022-11-15 旋智电子科技(上海)有限公司 用于加速数据运算的装置
CN115248701B (zh) * 2022-09-21 2022-12-20 进迭时空(杭州)科技有限公司 一种处理器寄存器堆之间的零拷贝数据传输装置及方法
CN115576606B (zh) * 2022-11-16 2023-03-21 苏州浪潮智能科技有限公司 实现矩阵转置乘的方法、协处理器、服务器及存储介质
CN116149554B (zh) * 2023-02-08 2023-11-24 珠海妙存科技有限公司 一种基于risc-v及其扩展指令的数据存储处理系统及其方法
CN116804915B (zh) * 2023-08-28 2023-12-15 腾讯科技(深圳)有限公司 基于存储器的数据交互方法、处理器、设备以及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740749A (zh) * 2017-10-30 2019-05-10 北京深鉴智能科技有限公司 高速全连接计算的硬件实现装置与方法
CN110007961A (zh) * 2019-02-01 2019-07-12 中山大学 一种基于risc-v的边缘计算硬件架构
CN111656367A (zh) * 2017-12-04 2020-09-11 优创半导体科技有限公司 神经网络加速器的系统和体系结构
US20210011653A1 (en) * 2019-07-08 2021-01-14 Canon Kabushiki Kaisha Operation processing apparatus, operation processing method, and non-transitory computer-readable storage medium
CN112860320A (zh) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 基于risc-v指令集进行数据处理的方法、系统、设备及介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1516001A (zh) * 2003-01-08 2004-07-28 上海海尔集成电路有限公司 一种新型risc流水线微控制器构架及其操作方法
CN100555225C (zh) * 2008-03-17 2009-10-28 中国科学院计算技术研究所 一种支持x86虚拟机的risc处理器装置及方法
CN106940815B (zh) * 2017-02-13 2020-07-28 西安交通大学 一种可编程卷积神经网络协处理器ip核
CN108647773B (zh) * 2018-04-20 2021-07-23 复旦大学 一种可重构卷积神经网络的硬件互连系统
CN110659069B (zh) * 2018-06-28 2022-08-19 赛灵思公司 用于执行神经网络计算的指令调度方法及相应计算系统
CN111191774B (zh) * 2018-11-14 2023-04-07 上海富瀚微电子股份有限公司 面向精简卷积神经网络的低代价加速器架构及其处理方法
CN111078287B (zh) * 2019-11-08 2022-07-19 苏州浪潮智能科技有限公司 一种向量运算协处理方法与装置
CN111160545A (zh) * 2019-12-31 2020-05-15 北京三快在线科技有限公司 人工神经网络处理系统及其数据处理方法
CN111582465B (zh) * 2020-05-08 2023-04-07 中国科学院上海高等研究院 基于fpga的卷积神经网络加速处理系统、方法以及终端
CN112130901A (zh) * 2020-09-11 2020-12-25 山东云海国创云计算装备产业创新中心有限公司 基于risc-v的协处理器、数据处理方法及存储介质
CN112232517B (zh) * 2020-09-24 2022-05-31 苏州浪潮智能科技有限公司 一种人工智能加速引擎和人工智能处理器

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740749A (zh) * 2017-10-30 2019-05-10 北京深鉴智能科技有限公司 高速全连接计算的硬件实现装置与方法
CN111656367A (zh) * 2017-12-04 2020-09-11 优创半导体科技有限公司 神经网络加速器的系统和体系结构
CN110007961A (zh) * 2019-02-01 2019-07-12 中山大学 一种基于risc-v的边缘计算硬件架构
US20210011653A1 (en) * 2019-07-08 2021-01-14 Canon Kabushiki Kaisha Operation processing apparatus, operation processing method, and non-transitory computer-readable storage medium
CN112860320A (zh) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 基于risc-v指令集进行数据处理的方法、系统、设备及介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801147A (zh) * 2022-11-30 2023-03-14 珠海笛思科技有限公司 数据通信处理方法及系统
CN115801147B (zh) * 2022-11-30 2023-09-22 珠海笛思科技有限公司 数据通信处理方法及系统

Also Published As

Publication number Publication date
CN112860320A (zh) 2021-05-28

Similar Documents

Publication Publication Date Title
WO2022170997A1 (fr) Procédé et système de traitement de données basés sur un ensemble d'instructions risc-v, et dispositif et support
CN111897579B (zh) 图像数据处理方法、装置、计算机设备和存储介质
JP6977239B2 (ja) 行列乗算器
CN109740747B (zh) 运算方法、装置及相关产品
KR102606825B1 (ko) 뉴럴 네트워크 모델을 변형하는 뉴럴 네트워크 시스템, 이를 포함하는 어플리케이션 프로세서 및 뉴럴 네트워크 시스템의 동작방법
KR20210082058A (ko) 콘볼루션 신경망들을 구현하기 위한 구성 가능형 프로세서 엘리먼트 어레이들
CN112612521A (zh) 一种用于执行矩阵乘运算的装置和方法
KR102610842B1 (ko) 뉴럴 네트워크에서의 프로세싱 엘리먼트 및 그 동작 방법
CN111105023B (zh) 数据流重构方法及可重构数据流处理器
WO2022134729A1 (fr) Procédé et système d'inférence en intelligence artificielle basés sur risc-v
CN112799726A (zh) 数据处理装置、方法及相关产品
CN111860773B (zh) 处理装置和用于信息处理的方法
Wen et al. Taso: Time and space optimization for memory-constrained DNN inference
Wang et al. SOLAR: Services-oriented deep learning architectures-deep learning as a service
CN112051981B (zh) 一种数据流水线计算路径结构及单线程数据流水线系统
Haghi et al. Flash: FPGA-accelerated smart switches with GCN case study
US20190272460A1 (en) Configurable neural network processor for machine learning workloads
Song et al. Gpnpu: Enabling efficient hardware-based direct convolution with multi-precision support in gpu tensor cores
Gottlieb et al. Clustered programmable-reconfigurable processors
de Dinechin et al. Deep learning inference on the mppa3 manycore processor
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
CN109189475A (zh) 可编程人工智能加速器指令集的构建方法
Gupta et al. Accelerating CNN inference on long vector architectures via co-design
CN114327639A (zh) 基于数据流架构的加速器、加速器的数据存取方法及设备
CN102446086A (zh) 一种可参量化专用指令集处理器设计平台

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22752158

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22752158

Country of ref document: EP

Kind code of ref document: A1